⌘K

GPT-4o Transcribe Diarize

vtrix-gpt-4o-transcribe-diarize

Speaker-aware speech-to-text model powered by GPT-4o. Produces transcripts with speaker labels, ideal for meetings, interviews, and multi-speaker conversations. Supports optional known speaker references for improved accuracy.

Authentication

authorization string required

All APIs require authentication via Bearer Token.

Get API Key:

Visit API Key Management Page to get your API Key.

Usage:

Add to request header:

Authorization: Bearer YOUR_API_KEY

Parameters

file file required

Audio file to transcribe with speaker diarization.

Supported formats: mp3, mp4, mpeg, mpga, m4a, wav, webm

File size limit: 25 MB


model string required

Model ID to use for the request.

Value: vtrix-gpt-4o-transcribe-diarize


response_format string

Format of the output transcript.

Options: json, text, diarized_json

Default: json

diarized_json format includes speaker segments with speaker, start, and end metadata.


chunking_strategy string

Audio segmentation strategy. Required when audio is longer than 30 seconds.

Options: auto (recommended), or custom Voice Activity Detection configuration

Default: null


known_speaker_names array

Array of known speaker names to map segments onto. Maximum 4 speakers supported.

Use with known_speaker_references to provide reference audio clips for each speaker.

Example: ["agent", "customer"]


known_speaker_references array

Array of reference audio clips encoded as data URLs. Each clip should be 2-10 seconds long.

Must correspond to known_speaker_names array. Supports same audio formats as main file.

Example: ["data:audio/wav;base64,AAA...", "data:audio/wav;base64,BBB..."]


stream boolean

Whether to stream the transcription incrementally. When enabled, emits transcript.text.segment events for each completed segment.

Default: false


language string

Language of the input audio in ISO-639-1 or ISO-639-3 format. Providing the input language improves accuracy and latency.

Examples: en (English), zh (Chinese), ja (Japanese), es (Spanish)


Response Format

text string

The complete transcribed text from the audio file without speaker labels.


segments array

Array of speaker-labeled segments. Only present when response_format is diarized_json.

speaker string

Speaker identifier. Format: speaker_1, speaker_2, etc., or known speaker name if provided.

text string

Transcribed text for this segment.

start number

Segment start time in seconds.

end number

Segment end time in seconds.


Supported Languages

Supports 98 languages including: Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.

Notes

vtrix-gpt-4o-transcribe-diarize does not support the following parameters:

  • prompt - Not available for diarized transcription
  • logprobs - Not available for diarized transcription
  • timestamp_granularities[] - Not available for diarized transcription