Skip to main content
HasteKit SDK supports speech-to-text transcription with various LLM providers like OpenAI, Gemini, and ElevenLabs.

Transcribe audio

Transcribe audio from a file. The audio data is sent as raw bytes along with the filename for MIME type detection.
import (
    "context"
    "fmt"
    "log"
    "os"
    "github.com/hastekit/hastekit-sdk-go/pkg/gateway/llm/transcription"
)

// Read audio file
audio, err := os.ReadFile("recording.mp3")
if err != nil {
    log.Fatal(err)
}

resp, err := client.NewTranscription(context.Background(), &transcription.Request{
    Model:         "OpenAI/whisper-1",
    Audio:         audio,
    AudioFilename: "recording.mp3",
})
if err != nil {
    log.Fatal(err)
}

fmt.Println("Transcription:", resp.Text)

Response

The response contains the transcribed text along with optional metadata like language, duration, word-level timestamps, and segments.
// Access transcribed text
fmt.Println("Text:", resp.Text)

// Access detected language
if resp.Language != nil {
    fmt.Println("Language:", *resp.Language)
}

// Access audio duration
if resp.Duration != nil {
    fmt.Printf("Duration: %.2f seconds\n", *resp.Duration)
}

// Access usage statistics
if resp.Usage != nil {
    fmt.Printf("Tokens used: %d\n", resp.Usage.TotalTokens)
}

Word-level timestamps

Request word-level timestamps for precise timing information.
resp, err := client.NewTranscription(context.Background(), &transcription.Request{
    Model:                  "OpenAI/whisper-1",
    Audio:                  audio,
    AudioFilename:          "recording.mp3",
    ResponseFormat:         utils.Ptr("verbose_json"),
    TimestampGranularities: []string{"word"},
})
if err != nil {
    log.Fatal(err)
}

for _, word := range resp.Words {
    fmt.Printf("[%.2fs - %.2fs] %s\n", word.Start, word.End, word.Word)
}

Segment-level timestamps

Request segment-level timestamps for paragraph or sentence-level timing.
resp, err := client.NewTranscription(context.Background(), &transcription.Request{
    Model:                  "OpenAI/whisper-1",
    Audio:                  audio,
    AudioFilename:          "recording.mp3",
    ResponseFormat:         utils.Ptr("verbose_json"),
    TimestampGranularities: []string{"segment"},
})
if err != nil {
    log.Fatal(err)
}

for _, seg := range resp.Segments {
    fmt.Printf("[%.2fs - %.2fs] %s\n", seg.Start, seg.End, seg.Text)
}

Request Configuration

ParameterTypeDescription
Audio[]byteRaw audio data to transcribe.
AudioFilenamestringFilename for the audio file (e.g., "audio.mp3"). Used for MIME type detection.
ModelstringThe model to use for transcription (e.g., "OpenAI/whisper-1").
Language*stringOptional. Language of the input audio in ISO-639-1 format (e.g., "en", "es").
Prompt*stringOptional. Text to guide the model’s style or continue a previous audio segment.
ResponseFormat*stringOptional. Output format: "json", "text", "srt", "verbose_json", or "vtt".
Temperature*float64Optional. Sampling temperature between 0 and 1. Lower values are more deterministic.
TimestampGranularities[]stringOptional. Timestamp granularities: "word", "segment". Requires verbose_json format.

Response Structure

FieldTypeDescription
TextstringThe transcribed text.
Language*stringDetected language of the audio.
Duration*float64Duration of the audio in seconds.
Words[]WordWord-level timestamps (when requested).
Segments[]SegmentSegment-level timestamps (when requested).
Usage*UsageToken usage statistics.

Word

FieldTypeDescription
WordstringThe transcribed word.
Startfloat64Start time in seconds.
Endfloat64End time in seconds.

Segment

FieldTypeDescription
IDintSegment index.
SeekintSeek offset of the segment.
Startfloat64Start time in seconds.
Endfloat64End time in seconds.
TextstringTranscribed text of the segment.
Temperaturefloat64Temperature used for this segment.
AvgLogprobfloat64Average log probability of the segment.
CompressionRatiofloat64Compression ratio of the segment.
NoSpeechProbfloat64Probability that the segment contains no speech.

Usage

FieldTypeDescription
PromptTokensintNumber of input tokens processed.
CompletionTokensintNumber of output tokens generated.
TotalTokensintTotal tokens used.

Example: Complete Transcription

package main

import (
    "context"
    "log"
    "os"

    "github.com/hastekit/hastekit-sdk-go/pkg/utils"
    "github.com/hastekit/hastekit-sdk-go/pkg/gateway"
    "github.com/hastekit/hastekit-sdk-go/pkg/gateway/llm"
    "github.com/hastekit/hastekit-sdk-go/pkg/gateway/llm/transcription"
    hastekit "github.com/hastekit/hastekit-sdk-go"
)

func main() {
    // Initialize SDK client
    client, err := hastekit.New(&hastekit.ClientOptions{
        ProviderConfigs: []gateway.ProviderConfig{
            {
                ProviderName:  llm.ProviderNameOpenAI,
                BaseURL:       "",
                CustomHeaders: nil,
                ApiKeys: []*gateway.APIKeyConfig{
                    {
                        Name:   "Key 1",
                        APIKey: os.Getenv("OPENAI_API_KEY"),
                    },
                },
            },
        },
    })
    if err != nil {
        log.Fatal(err)
    }

    // Read audio file
    audio, err := os.ReadFile("meeting.mp3")
    if err != nil {
        log.Fatal(err)
    }

    // Transcribe with word-level timestamps
    resp, err := client.NewTranscription(context.Background(), &transcription.Request{
        Model:                  "OpenAI/whisper-1",
        Audio:                  audio,
        AudioFilename:          "meeting.mp3",
        Language:               utils.Ptr("en"),
        ResponseFormat:         utils.Ptr("verbose_json"),
        TimestampGranularities: []string{"word", "segment"},
    })
    if err != nil {
        log.Fatal(err)
    }

    log.Printf("Transcription: %s\n", resp.Text)
    if resp.Duration != nil {
        log.Printf("Duration: %.2f seconds\n", *resp.Duration)
    }
    if resp.Language != nil {
        log.Printf("Language: %s\n", *resp.Language)
    }
}

Supported Audio Formats

The SDK automatically detects the MIME type from the AudioFilename. Supported formats include:
FormatExtensionMIME Type
MP3.mp3audio/mpeg
WAV.wavaudio/wav
FLAC.flacaudio/flac
OGG.oggaudio/ogg
M4A.m4aaudio/mp4
AAC.aacaudio/aac
WebM.webmaudio/webm
PCM.pcmaudio/L16

Supported Providers

ProviderTranscription
OpenAI
Gemini
ElevenLabs
Anthropic