Transcribe audio
Transcribe audio from a file. The audio data is sent as raw bytes along with the filename for MIME type detection.Response
The response contains the transcribed text along with optional metadata like language, duration, word-level timestamps, and segments.Word-level timestamps
Request word-level timestamps for precise timing information.Segment-level timestamps
Request segment-level timestamps for paragraph or sentence-level timing.Request Configuration
| Parameter | Type | Description |
|---|---|---|
| Audio | []byte | Raw audio data to transcribe. |
| AudioFilename | string | Filename for the audio file (e.g., "audio.mp3"). Used for MIME type detection. |
| Model | string | The model to use for transcription (e.g., "OpenAI/whisper-1"). |
| Language | *string | Optional. Language of the input audio in ISO-639-1 format (e.g., "en", "es"). |
| Prompt | *string | Optional. Text to guide the model’s style or continue a previous audio segment. |
| ResponseFormat | *string | Optional. Output format: "json", "text", "srt", "verbose_json", or "vtt". |
| Temperature | *float64 | Optional. Sampling temperature between 0 and 1. Lower values are more deterministic. |
| TimestampGranularities | []string | Optional. Timestamp granularities: "word", "segment". Requires verbose_json format. |
Response Structure
| Field | Type | Description |
|---|---|---|
| Text | string | The transcribed text. |
| Language | *string | Detected language of the audio. |
| Duration | *float64 | Duration of the audio in seconds. |
| Words | []Word | Word-level timestamps (when requested). |
| Segments | []Segment | Segment-level timestamps (when requested). |
| Usage | *Usage | Token usage statistics. |
Word
| Field | Type | Description |
|---|---|---|
| Word | string | The transcribed word. |
| Start | float64 | Start time in seconds. |
| End | float64 | End time in seconds. |
Segment
| Field | Type | Description |
|---|---|---|
| ID | int | Segment index. |
| Seek | int | Seek offset of the segment. |
| Start | float64 | Start time in seconds. |
| End | float64 | End time in seconds. |
| Text | string | Transcribed text of the segment. |
| Temperature | float64 | Temperature used for this segment. |
| AvgLogprob | float64 | Average log probability of the segment. |
| CompressionRatio | float64 | Compression ratio of the segment. |
| NoSpeechProb | float64 | Probability that the segment contains no speech. |
Usage
| Field | Type | Description |
|---|---|---|
| PromptTokens | int | Number of input tokens processed. |
| CompletionTokens | int | Number of output tokens generated. |
| TotalTokens | int | Total tokens used. |
Example: Complete Transcription
Supported Audio Formats
The SDK automatically detects the MIME type from theAudioFilename. Supported formats include:
| Format | Extension | MIME Type |
|---|---|---|
| MP3 | .mp3 | audio/mpeg |
| WAV | .wav | audio/wav |
| FLAC | .flac | audio/flac |
| OGG | .ogg | audio/ogg |
| M4A | .m4a | audio/mp4 |
| AAC | .aac | audio/aac |
| WebM | .webm | audio/webm |
| PCM | .pcm | audio/L16 |
Supported Providers
| Provider | Transcription |
|---|---|
| OpenAI | ✅ |
| Gemini | ✅ |
| ElevenLabs | ✅ |
| Anthropic | ❌ |