GPT-4o Transcribe Diarize

Authentication

authorization `string` required

All APIs require authentication via Bearer Token.

Get API Key:

Visit API Key Management Page to get your API Key.

Usage:

Add to request header:

Authorization: Bearer YOUR_API_KEY

Parameters

file `file` required

Audio file to transcribe with speaker diarization.

Supported formats: mp3, mp4, mpeg, mpga, m4a, wav, webm

File size limit: 25 MB

model `string` required

Model ID to use for the request.

Value: vtrix-gpt-4o-transcribe-diarize

response_format `string`

Format of the output transcript.

Options: json, text, diarized_json

Default: json

diarized_json format includes speaker segments with speaker, start, and end metadata.

chunking_strategy `string`

Audio segmentation strategy. Required when audio is longer than 30 seconds.

Options: auto (recommended), or custom Voice Activity Detection configuration

Default: null

known_speaker_names `array`

Array of known speaker names to map segments onto. Maximum 4 speakers supported.

Use with known_speaker_references to provide reference audio clips for each speaker.

Example: ["agent", "customer"]

known_speaker_references `array`

Array of reference audio clips encoded as data URLs. Each clip should be 2-10 seconds long.

Must correspond to known_speaker_names array. Supports same audio formats as main file.

Example: ["data:audio/wav;base64,AAA...", "data:audio/wav;base64,BBB..."]

stream `boolean`

Whether to stream the transcription incrementally. When enabled, emits transcript.text.segment events for each completed segment.

Default: false

language `string`

Language of the input audio in ISO-639-1 or ISO-639-3 format. Providing the input language improves accuracy and latency.

Examples: en (English), zh (Chinese), ja (Japanese), es (Spanish)

Response Format

text `string`

The complete transcribed text from the audio file without speaker labels.

segments `array`

Array of speaker-labeled segments. Only present when response_format is diarized_json.

speaker string

Speaker identifier. Format: speaker_1, speaker_2, etc., or known speaker name if provided.

text string

Transcribed text for this segment.

start number

Segment start time in seconds.

end number

Segment end time in seconds.

Supported Languages

Supports 98 languages including: Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.

Notes

vtrix-gpt-4o-transcribe-diarize does not support the following parameters:

prompt - Not available for diarized transcription
logprobs - Not available for diarized transcription
timestamp_granularities[] - Not available for diarized transcription

Authentication

authorization `string` required

Parameters

file `file` required

model `string` required

response_format `string`

chunking_strategy `string`

known_speaker_names `array`

known_speaker_references `array`

stream `boolean`

language `string`

Response Format

text `string`

segments `array`

speaker `string`

text `string`

start `number`

end `number`

Supported Languages

Notes

Authentication

authorization string required

Parameters

file file required

model string required

response_format string

chunking_strategy string

known_speaker_names array

known_speaker_references array

stream boolean

language string

Response Format

text string

segments array

speaker string

text string

start number

end number

Supported Languages

Notes

authorization `string` required

file `file` required

model `string` required

response_format `string`

chunking_strategy `string`

known_speaker_names `array`

known_speaker_references `array`

stream `boolean`

language `string`

text `string`

segments `array`

speaker `string`

text `string`

start `number`

end `number`