Authentication
authorization string required
All APIs require authentication via Bearer Token.
Get API Key:
Visit API Key Management Page to get your API Key
Usage:
Add to request header:
Authorization: Bearer YOUR_API_KEY
Parameters
model string required
Model ID to use for the request
Value: minimax_speech_28_hd
text string required
Text to convert to speech. Character limit is less than 10,000 characters
For text longer than 3,000 characters, streaming output is recommended
Paragraph switching: Use line breaks
Pause control: Add <#x#> markers in text where x is pause duration in seconds. Range [0.01, 99.99], maximum 2 decimal places. Pause markers must be placed between readable text segments and cannot be used consecutively
Sound effect tags: Supports inserting sound effect tags in text. Supported sound effects: (laughs) (laughter), (chuckle) (light laugh), (coughs) (cough), (clear-throat) (throat clearing), (groans) (groan), (breath) (normal breathing), (pant) (panting), (inhale) (inhale), (exhale) (exhale), (gasps) (gasp), (sniffs) (sniff), (sighs) (sigh), (snorts) (snort), (burps) (burp), (lip-smacking) (lip smacking), (humming) (humming), (hissing) (hissing), (emm) (um), (sneezes) (sneeze)
stream boolean
Whether to enable streaming output
Default: false
stream_options object
Streaming configuration
exclude_aggregated_audio
booleanWhether the last chunk includes concatenated audio hex data
Default:
false(last chunk contains complete concatenated audio hex data)
output_format string
Audio output format
Options: url (return audio file URL), hex (return hexadecimal audio data)
Default: url
voice_setting object
Voice settings including voice ID, speed, volume, pitch
voice_id
stringVoice ID for synthesis. If using mixed voices, set
timbre_weightsparameter and leave this emptySupports system voices, cloned voices, and text-generated voices. See System Voice List
speed
numberSpeech rate. Higher value means faster speech
Range:
0.5-2Default:
1.0
vol
numberAudio volume. Higher value means louder volume
Range:
0-10Default:
1.0
pitch
integerAudio pitch
Range:
-12-12Default:
0(original voice output)
emotion
stringEmotion control for synthesized speech. The model automatically matches appropriate emotions based on input text, manual specification is generally not needed
Options:
happy,sad,angry,fearful,disgusted,surprised,calm,fluent(corresponding to 8 emotions: happy, sad, angry, fearful, disgusted, surprised, neutral, vivid)
text_normalization
booleanWhether to enable Chinese and English text normalization. When enabled, improves performance in number reading scenarios but slightly increases latency
Default:
false
latex_read
booleanWhether to read LaTeX formulas
Default:
falseNote:
Only supports Chinese, when enabledlanguage_boostparameter will be set to Chinese
Formulas in request need$$at start and end
\in formulas must be escaped to\\. Example:$$x = \\frac{-b \\pm \\sqrt{b^2 - 4ac}}{2a}$$
audio_setting object
Audio configuration
sample_rate
integerSample rate for generated audio
Options:
8000,16000,22050,24000,32000,44100Default:
32000
bitrate
stringBitrate for generated audio. Only effective for mp3 format
Options:
32000,64000,128000,256000Default:
128000
format
stringFormat for generated audio.
wavonly supported in non-streaming outputOptions:
mp3,pcm,flac,wav
channel
integerNumber of audio channels
Options:
1(mono),2(stereo)Default:
1
force_cbr
booleanConstant bitrate (CBR) control. When set to true, audio will be encoded at constant bitrate
Default:
falseNote: This parameter only takes effect when audio is set to streaming output and format is mp3
pronunciation_dict object
Pronunciation dictionary for custom word pronunciation
tone
arrayDefine pronunciation rules for specific characters or symbols that need special annotation
For Chinese text, tones are represented by numbers: 1 for first tone, 2 for second tone, 3 for third tone, 4 for fourth tone, 5 for neutral tone
Examples:
["燕少飞/(yan4)(shao3)(fei1)", "omg/oh my god"]
timber_weights array
Mixed voice configuration. Maximum 4 voices supported
Each item contains:
voice_id
stringVoice ID for synthesis. Must be filled together with weight parameter
Supports system voices, cloned voices, and text-generated voices
weight
integerWeight for each voice. Must be filled together with voice_id
Range:
1-100Higher weight means higher similarity to that voice
Example:
"timbre_weights": [ { "voice_id": "female-chengshu", "weight": 30 }, { "voice_id": "female-tianmei", "weight": 70 } ]
language_boost string
Whether to enhance recognition of specified minority languages and dialects. Can be set to auto for automatic detection
Default: null
Options: Chinese, Chinese,Yue, English, Arabic, Russian, Spanish, French, Portuguese, German, Turkish, Dutch, Ukrainian, Vietnamese, Indonesian, Japanese, Italian, Korean, Thai, Polish, Romanian, Greek, Czech, Finnish, Hindi, Bulgarian, Danish, Hebrew, Malay, Persian, Slovak, Swedish, Croatian, Filipino, Hungarian, Norwegian, Slovenian, Catalan, Nynorsk, Tamil, Afrikaans, auto
voice_modify object
Voice effect processor settings. Supported audio formats: non-streaming: mp3, wav, flac; streaming: mp3
pitch
integerPitch adjustment (low/bright)
Range:
-100-100Values closer to
-100sound deeper, closer to100sound brighter
intensity
integerIntensity adjustment (power/softness)
Range:
-100-100Values closer to
-100sound more powerful, closer to100sound softer
timbre
integerTimbre adjustment (magnetic/crisp)
Range:
-100-100Values closer to
-100sound more mellow, closer to100sound crisper
sound_effects
stringSound effects setting. Only one can be selected per request
Options:
spacious_echo(spacious echo),auditorium_echo(auditorium broadcast),lofi_telephone(telephone distortion),robotic(electronic)
subtitle_enable boolean
Whether to enable subtitle service. Only effective in non-streaming output scenarios
Default: false
aigc_watermark boolean
Whether to add audio rhythm identifier at the end of synthesized audio. Only effective for non-streaming synthesis
Default: false
Polling
Since audio generation takes time, you need to poll the task status after creation
The initial response returns the task ID and initial status. The actual generation results must be obtained through polling the task status endpoint
Response Format
error object
Error information. Only present when status is failed.
code
integerError code.
error_message
stringDetailed error message.
output array
Generation results. Only present when status is completed.
content
arrayList of generated audio content
type
stringResource type, fixed as
audiourl
stringAudio file URL (when output_format is
url)data
stringAudio hexadecimal data (when output_format is
hex)format
stringData format (used in streaming output)
index
integerData chunk index (used in streaming output)
size
integerData chunk size (used in streaming output)
usage object
Usage statistics. Only present when status is completed
metadata object
Metadata information
Error Codes
| Error Code | Description |
|---|---|
| 010008095 | Internal generation error |
| 010008096 | Result parsing error |
| 010008097 | HTTP error response |
| 010008099 | Synchronous generation error |