⌘K

Minimax Speech 2.8 Turbo

minimax_speech_28_turbo

Seamless speed meets natural flow

Authentication

authorization string required

All APIs require authentication via Bearer Token.

Get API Key:

Visit API Key Management Page to get your API Key

Usage:

Add to request header:

Authorization: Bearer YOUR_API_KEY

Parameters

model string required

Model ID to use for the request

Value: minimax_speech_28_turbo


text string required

Text to convert to speech. Character limit is less than 10,000 characters

For text longer than 3,000 characters, streaming output is recommended

Paragraph switching: Use line breaks
Pause control: Add <#x#> markers in text where x is pause duration in seconds. Range [0.01, 99.99], maximum 2 decimal places. Pause markers must be placed between readable text segments and cannot be used consecutively

Sound effect tags: Supports inserting sound effect tags in text. Supported sound effects: (laughs) (laughter), (chuckle) (light laugh), (coughs) (cough), (clear-throat) (throat clearing), (groans) (groan), (breath) (normal breathing), (pant) (panting), (inhale) (inhale), (exhale) (exhale), (gasps) (gasp), (sniffs) (sniff), (sighs) (sigh), (snorts) (snort), (burps) (burp), (lip-smacking) (lip smacking), (humming) (humming), (hissing) (hissing), (emm) (um), (sneezes) (sneeze)


stream boolean

Whether to enable streaming output

Default: false


stream_options object

Streaming configuration

exclude_aggregated_audio boolean

Whether the last chunk includes concatenated audio hex data

Default: false (last chunk contains complete concatenated audio hex data)


output_format string

Audio output format

Options: url (return audio file URL), hex (return hexadecimal audio data)

Default: url


voice_setting object

Voice settings including voice ID, speed, volume, pitch

voice_id string

Voice ID for synthesis. If using mixed voices, set timbre_weights parameter and leave this empty

Supports system voices, cloned voices, and text-generated voices. See System Voice List

speed number

Speech rate. Higher value means faster speech

Range: 0.5 - 2

Default: 1.0

vol number

Audio volume. Higher value means louder volume

Range: 0 - 10

Default: 1.0

pitch integer

Audio pitch

Range: -12 - 12

Default: 0 (original voice output)

emotion string

Emotion control for synthesized speech. The model automatically matches appropriate emotions based on input text, manual specification is generally not needed

Options: happy, sad, angry, fearful, disgusted, surprised, calm, fluent (corresponding to 8 emotions: happy, sad, angry, fearful, disgusted, surprised, neutral, vivid)

text_normalization boolean

Whether to enable Chinese and English text normalization. When enabled, improves performance in number reading scenarios but slightly increases latency

Default: false

latex_read boolean

Whether to read LaTeX formulas

Default: false

Note:
Only supports Chinese, when enabled language_boost parameter will be set to Chinese
Formulas in request need $$ at start and end
\ in formulas must be escaped to \\. Example: $$x = \\frac{-b \\pm \\sqrt{b^2 - 4ac}}{2a}$$


audio_setting object

Audio configuration

sample_rate integer

Sample rate for generated audio

Options: 8000, 16000, 22050, 24000, 32000, 44100

Default: 32000

bitrate string

Bitrate for generated audio. Only effective for mp3 format

Options: 32000, 64000, 128000, 256000

Default: 128000

format string

Format for generated audio. wav only supported in non-streaming output

Options: mp3, pcm, flac, wav

channel integer

Number of audio channels

Options: 1 (mono), 2 (stereo)

Default: 1

force_cbr boolean

Constant bitrate (CBR) control. When set to true, audio will be encoded at constant bitrate

Default: false

Note: This parameter only takes effect when audio is set to streaming output and format is mp3


pronunciation_dict object

Pronunciation dictionary for custom word pronunciation

tone array

Define pronunciation rules for specific characters or symbols that need special annotation

For Chinese text, tones are represented by numbers: 1 for first tone, 2 for second tone, 3 for third tone, 4 for fourth tone, 5 for neutral tone

Examples: ["燕少飞/(yan4)(shao3)(fei1)", "omg/oh my god"]


timber_weights array

Mixed voice configuration. Maximum 4 voices supported

Each item contains:

voice_id string

Voice ID for synthesis. Must be filled together with weight parameter

Supports system voices, cloned voices, and text-generated voices

weight integer

Weight for each voice. Must be filled together with voice_id

Range: 1 - 100

Higher weight means higher similarity to that voice

Example:

"timbre_weights": [
  {
    "voice_id": "female-chengshu",
    "weight": 30
  },
  {
    "voice_id": "female-tianmei",
    "weight": 70
  }
]

language_boost string

Whether to enhance recognition of specified minority languages and dialects. Can be set to auto for automatic detection

Default: null

Options: Chinese, Chinese,Yue, English, Arabic, Russian, Spanish, French, Portuguese, German, Turkish, Dutch, Ukrainian, Vietnamese, Indonesian, Japanese, Italian, Korean, Thai, Polish, Romanian, Greek, Czech, Finnish, Hindi, Bulgarian, Danish, Hebrew, Malay, Persian, Slovak, Swedish, Croatian, Filipino, Hungarian, Norwegian, Slovenian, Catalan, Nynorsk, Tamil, Afrikaans, auto


voice_modify object

Voice effect processor settings. Supported audio formats: non-streaming: mp3, wav, flac; streaming: mp3

pitch integer

Pitch adjustment (low/bright)

Range: -100 - 100

Values closer to -100 sound deeper, closer to 100 sound brighter

intensity integer

Intensity adjustment (power/softness)

Range: -100 - 100

Values closer to -100 sound more powerful, closer to 100 sound softer

timbre integer

Timbre adjustment (magnetic/crisp)

Range: -100 - 100

Values closer to -100 sound more mellow, closer to 100 sound crisper

sound_effects string

Sound effects setting. Only one can be selected per request

Options: spacious_echo (spacious echo), auditorium_echo (auditorium broadcast), lofi_telephone (telephone distortion), robotic (electronic)


subtitle_enable boolean

Whether to enable subtitle service. Only effective in non-streaming output scenarios

Default: false


aigc_watermark boolean

Whether to add audio rhythm identifier at the end of synthesized audio. Only effective for non-streaming synthesis

Default: false


Polling

Since audio generation takes time, you need to poll the task status after creation

The initial response returns the task ID and initial status. The actual generation results must be obtained through polling the task status endpoint

Response Format

error object

Error information. Only present when status is failed.

code integer

Error code.

error_message string

Detailed error message.


output array

Generation results. Only present when status is completed.

content array

List of generated audio content

type string

Resource type, fixed as audio

url string

Audio file URL (when output_format is url)

data string

Audio hexadecimal data (when output_format is hex)

format string

Data format (used in streaming output)

index integer

Data chunk index (used in streaming output)

size integer

Data chunk size (used in streaming output)


usage object

Usage statistics. Only present when status is completed


metadata object

Metadata information


Error Codes

Error CodeDescription
010008095Internal generation error
010008096Result parsing error
010008097HTTP error response
010008099Synchronous generation error