Minimax Speech 2.8 Hd | Vtrix API Docs

Authentication

authorization `string` required

All APIs require authentication via Bearer Token.

Get API Key:

Visit API Key Management Page to get your API Key

Usage:

Add to request header:

Authorization: Bearer YOUR_API_KEY

Parameters

model `string` required

Model ID to use for the request

Value: minimax_speech_28_hd

text `string` required

Text to convert to speech. Character limit is less than 10,000 characters

For text longer than 3,000 characters, streaming output is recommended

Paragraph switching: Use line breaks
Pause control: Add <#x#> markers in text where x is pause duration in seconds. Range [0.01, 99.99], maximum 2 decimal places. Pause markers must be placed between readable text segments and cannot be used consecutively

Sound effect tags: Supports inserting sound effect tags in text. Supported sound effects: (laughs) (laughter), (chuckle) (light laugh), (coughs) (cough), (clear-throat) (throat clearing), (groans) (groan), (breath) (normal breathing), (pant) (panting), (inhale) (inhale), (exhale) (exhale), (gasps) (gasp), (sniffs) (sniff), (sighs) (sigh), (snorts) (snort), (burps) (burp), (lip-smacking) (lip smacking), (humming) (humming), (hissing) (hissing), (emm) (um), (sneezes) (sneeze)

stream `boolean`

Whether to enable streaming output

Default: false

stream_options `object`

Streaming configuration

exclude_aggregated_audio boolean

Whether the last chunk includes concatenated audio hex data

Default: false (last chunk contains complete concatenated audio hex data)

output_format `string`

Audio output format

Options: url (return audio file URL), hex (return hexadecimal audio data)

Default: url

voice_setting `object`

Voice settings including voice ID, speed, volume, pitch

voice_id string

Voice ID for synthesis. If using mixed voices, set timbre_weights parameter and leave this empty

Supports system voices, cloned voices, and text-generated voices. See System Voice List

speed number

Speech rate. Higher value means faster speech

Range: 0.5 - 2

Default: 1.0

vol number

Audio volume. Higher value means louder volume

Range: 0 - 10

Default: 1.0

pitch integer

Audio pitch

Range: -12 - 12

Default: 0 (original voice output)

emotion string

Emotion control for synthesized speech. The model automatically matches appropriate emotions based on input text, manual specification is generally not needed

Options: happy, sad, angry, fearful, disgusted, surprised, calm, fluent (corresponding to 8 emotions: happy, sad, angry, fearful, disgusted, surprised, neutral, vivid)

text_normalization boolean

Whether to enable Chinese and English text normalization. When enabled, improves performance in number reading scenarios but slightly increases latency

Default: false

latex_read boolean

Whether to read LaTeX formulas

Default: false

Note:
Only supports Chinese, when enabled language_boost parameter will be set to Chinese
Formulas in request need $$ at start and end
\ in formulas must be escaped to \\. Example: $$x = \\frac{-b \\pm \\sqrt{b^2 - 4ac}}{2a}$$

audio_setting `object`

Audio configuration

sample_rate integer

Sample rate for generated audio

Options: 8000, 16000, 22050, 24000, 32000, 44100

Default: 32000

bitrate string

Bitrate for generated audio. Only effective for mp3 format

Options: 32000, 64000, 128000, 256000

Default: 128000

format string

Format for generated audio. wav only supported in non-streaming output

Options: mp3, pcm, flac, wav

channel integer

Number of audio channels

Options: 1 (mono), 2 (stereo)

Default: 1

force_cbr boolean

Constant bitrate (CBR) control. When set to true, audio will be encoded at constant bitrate

Default: false

Note: This parameter only takes effect when audio is set to streaming output and format is mp3

pronunciation_dict `object`

Pronunciation dictionary for custom word pronunciation

tone array

Define pronunciation rules for specific characters or symbols that need special annotation

For Chinese text, tones are represented by numbers: 1 for first tone, 2 for second tone, 3 for third tone, 4 for fourth tone, 5 for neutral tone

Examples: ["燕少飞/(yan4)(shao3)(fei1)", "omg/oh my god"]

timber_weights `array`

Mixed voice configuration. Maximum 4 voices supported

Each item contains:

voice_id string

Voice ID for synthesis. Must be filled together with weight parameter

Supports system voices, cloned voices, and text-generated voices

weight integer

Weight for each voice. Must be filled together with voice_id

Range: 1 - 100

Higher weight means higher similarity to that voice

Example:
"timbre_weights": [
  {
    "voice_id": "female-chengshu",
    "weight": 30
  },
  {
    "voice_id": "female-tianmei",
    "weight": 70
  }
]

language_boost `string`

Whether to enhance recognition of specified minority languages and dialects. Can be set to auto for automatic detection

Default: null

Options: Chinese, Chinese,Yue, English, Arabic, Russian, Spanish, French, Portuguese, German, Turkish, Dutch, Ukrainian, Vietnamese, Indonesian, Japanese, Italian, Korean, Thai, Polish, Romanian, Greek, Czech, Finnish, Hindi, Bulgarian, Danish, Hebrew, Malay, Persian, Slovak, Swedish, Croatian, Filipino, Hungarian, Norwegian, Slovenian, Catalan, Nynorsk, Tamil, Afrikaans, auto

voice_modify `object`

Voice effect processor settings. Supported audio formats: non-streaming: mp3, wav, flac; streaming: mp3

pitch integer

Pitch adjustment (low/bright)

Range: -100 - 100

Values closer to -100 sound deeper, closer to 100 sound brighter

intensity integer

Intensity adjustment (power/softness)

Range: -100 - 100

Values closer to -100 sound more powerful, closer to 100 sound softer

timbre integer

Timbre adjustment (magnetic/crisp)

Range: -100 - 100

Values closer to -100 sound more mellow, closer to 100 sound crisper

sound_effects string

Sound effects setting. Only one can be selected per request

Options: spacious_echo (spacious echo), auditorium_echo (auditorium broadcast), lofi_telephone (telephone distortion), robotic (electronic)

subtitle_enable `boolean`

Whether to enable subtitle service. Only effective in non-streaming output scenarios

Default: false

aigc_watermark `boolean`

Whether to add audio rhythm identifier at the end of synthesized audio. Only effective for non-streaming synthesis

Default: false

Polling

Since audio generation takes time, you need to poll the task status after creation

The initial response returns the task ID and initial status. The actual generation results must be obtained through polling the task status endpoint

Response Format

error `object`

Error information. Only present when status is failed.

code integer

Error code.

error_message string

Detailed error message.

output `array`

Generation results. Only present when status is completed.

content array

List of generated audio content

type string

Resource type, fixed as audio

url string

Audio file URL (when output_format is url)

data string

Audio hexadecimal data (when output_format is hex)

format string

Data format (used in streaming output)

index integer

Data chunk index (used in streaming output)

size integer

Data chunk size (used in streaming output)

usage `object`

Usage statistics. Only present when status is completed

metadata `object`

Metadata information

Error Codes

Error Code	Description
010008095	Internal generation error
010008096	Result parsing error
010008097	HTTP error response
010008099	Synchronous generation error

Authentication

authorization string required

Parameters

model string required

text string required

stream boolean

stream_options object

exclude_aggregated_audio boolean

output_format string

voice_setting object

voice_id string

speed number

vol number

pitch integer

emotion string

text_normalization boolean

latex_read boolean

audio_setting object

sample_rate integer

bitrate string

format string

channel integer

force_cbr boolean

pronunciation_dict object

tone array

timber_weights array

voice_id string

weight integer

language_boost string

voice_modify object

pitch integer

intensity integer

timbre integer

sound_effects string

subtitle_enable boolean

aigc_watermark boolean

Polling

Response Format

error object

code integer

error_message string

output array

content array

type string

url string

data string

format string

index integer

size integer

usage object

metadata object

Error Codes

authorization `string` required

model `string` required

text `string` required

stream `boolean`

stream_options `object`

exclude_aggregated_audio `boolean`

output_format `string`

voice_setting `object`

voice_id `string`

speed `number`

vol `number`

pitch `integer`

emotion `string`

text_normalization `boolean`

latex_read `boolean`

audio_setting `object`

sample_rate `integer`

bitrate `string`

format `string`

channel `integer`

force_cbr `boolean`

pronunciation_dict `object`

tone `array`

timber_weights `array`

voice_id `string`

weight `integer`

language_boost `string`

voice_modify `object`

pitch `integer`

intensity `integer`

timbre `integer`

sound_effects `string`

subtitle_enable `boolean`

aigc_watermark `boolean`

error `object`

code `integer`

error_message `string`

output `array`

content `array`

type `string`

url `string`

data `string`

format `string`

index `integer`

size `integer`

usage `object`

metadata `object`