Authentication
authorization string required
All APIs require authentication via Bearer Token.
Get API Key:
Visit API Key Management Page to get your API Key.
Usage:
Add to request header:
Authorization: Bearer YOUR_API_KEY
Parameters
file file required
Audio file to transcribe with speaker diarization.
Supported formats: mp3, mp4, mpeg, mpga, m4a, wav, webm
File size limit: 25 MB
model string required
Model ID to use for the request.
Value: vtrix-gpt-4o-transcribe-diarize
response_format string
Format of the output transcript.
Options: json, text, diarized_json
Default: json
diarized_json format includes speaker segments with speaker, start, and end metadata.
chunking_strategy string
Audio segmentation strategy. Required when audio is longer than 30 seconds.
Options: auto (recommended), or custom Voice Activity Detection configuration
Default: null
known_speaker_names array
Array of known speaker names to map segments onto. Maximum 4 speakers supported.
Use with known_speaker_references to provide reference audio clips for each speaker.
Example: ["agent", "customer"]
known_speaker_references array
Array of reference audio clips encoded as data URLs. Each clip should be 2-10 seconds long.
Must correspond to known_speaker_names array. Supports same audio formats as main file.
Example: ["data:audio/wav;base64,AAA...", "data:audio/wav;base64,BBB..."]
stream boolean
Whether to stream the transcription incrementally. When enabled, emits transcript.text.segment events for each completed segment.
Default: false
language string
Language of the input audio in ISO-639-1 or ISO-639-3 format. Providing the input language improves accuracy and latency.
Examples: en (English), zh (Chinese), ja (Japanese), es (Spanish)
Response Format
text string
The complete transcribed text from the audio file without speaker labels.
segments array
Array of speaker-labeled segments. Only present when response_format is diarized_json.
speaker
stringSpeaker identifier. Format:
speaker_1,speaker_2, etc., or known speaker name if provided.
text
stringTranscribed text for this segment.
start
numberSegment start time in seconds.
end
numberSegment end time in seconds.
Supported Languages
Supports 98 languages including: Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.
Notes
vtrix-gpt-4o-transcribe-diarize does not support the following parameters:
prompt- Not available for diarized transcriptionlogprobs- Not available for diarized transcriptiontimestamp_granularities[]- Not available for diarized transcription