Logo

Automatic Speech Recognition

Convert audio to text using a fine-tuned Whisper model with GPU acceleration. Single-file transcription via HTTP and real-time streaming with VAD end-of-turn detection via WebSocket.

POST  /transcribe

Transcribe Audio File

Upload an audio file and receive the full transcript. Internally resamples to 16 kHz mono before inference. Long files are auto-split using Silero-VAD.

Request — multipart/form-data
file
binary
required
Audio file. Accepted: wav, mp3, flac, ogg, m4a, webm.
language
string
optional
ISO 639-1 code to force language. Omit for auto-detect.
Default: null
Response
filename
string
Original filename.
text
string
Full transcript.
segments
array
Per-segment objects with start, end, text, tokens, avg_logprob, no_speech_prob.
duration
float
Audio duration in seconds.
processing_time
float
Inference time in seconds.
Request
const form = new FormData();
form.append('file', blob, 'audio.wav');
const res = await fetch(
  'http://localhost:8124/transcribe',
  { method: 'POST', body: form }
);
const data = await res.json();
200 OK
{
  "filename": "audio.wav",
  "text": "Xin chào, tôi cần hỗ trợ.",
  "segments": [{
    "id": 0, "start": 0.0, "end": 3.5,
    "text": "Xin chào, tôi cần hỗ trợ.",
    "avg_logprob": -0.15,
    "no_speech_prob": 0.001
  }],
  "duration": 3.84,
  "processing_time": 0.62
}
POST  /transcribe-with-property

Transcribe with Diarization

Transcribe with word-level timestamps and speaker diarization. Uses pyannote.audio to identify and label multiple speakers in the recording.

Request — multipart/form-data
file
binary
required
Audio file (wav, mp3, flac, ogg, m4a).
language
string
optional
Language code. Auto-detect if omitted.
Response
dialogues
array
Array of speaker turns, each with speaker label, start/end, text, and word-level timestamps.
processing_time
float
Total processing time in seconds.
200 OK
{
  "dialogues": [
    {
      "speaker": "SPEAKER_00",
      "start": 0.0, "end": 4.2,
      "text": "Xin chào bạn.",
      "words": [
        {"word": "Xin",   "start": 0.05, "end": 0.25},
        {"word": "chào",  "start": 0.28, "end": 0.55},
        {"word": "bạn.",  "start": 0.60, "end": 0.90}
      ]
    },
    {
      "speaker": "SPEAKER_01",
      "start": 4.5, "end": 7.1,
      "text": "Vâng, tôi nghe rồi.",
      "words": [...]
    }
  ],
  "processing_time": 12.5
}
WS  /ws/transcribe/vad

Real-time Streaming Transcription

Stream PCM audio and receive partial transcripts as the user speaks, followed by a final is_final event when silence is detected. Uses Silero-VAD for end-of-turn detection.

Recommended: 100ms chunks with silence_ms=300 for minimum latency. Chunk size must be < silence threshold.

Query Parameters
ssap
bool
optional
Enable Semantic Social Audio Profiler. Emits voice properties (gender, age, emotion) alongside transcripts.
Default: false

Client → Server
audio_event
object
PCM audio chunk. Must be int16, 16 kHz, mono, base64-encoded in audio_event.audio_base_64.
config
object
{"type":"config","ssap":true} — Enable SSAP mid-stream.
end
object
{"type":"end"} — Flush buffer, emit final, close.

Server → Client
type
"transcript" | "ssap" | "error"
Event type.
text
string
Transcribed text (present on transcript events).
is_final
boolean
true = end-of-turn (silence > 300 ms). Forward to LLM. false = partial preview.
properties
object
On type:"ssap" — voice profile: {gender, age, emotion} each as Record<string, float>.

Server-side defaults
silence_ms
int
End-of-turn silence threshold.
Default: 300
volume_threshold
float
RMS amplitude gate (~-40 dBFS). Filters breath and background noise.
Default: 0.01
partial_interval
float
Minimum voiced audio (sec) before emitting a partial.
Default: 1.0s
Connect & Stream
const ws = new WebSocket(
  'ws://localhost:8124/ws/transcribe/vad'
);

// Float32 mic → Int16 → base64
function toBase64(f32) {
  const i16 = new Int16Array(f32.length);
  for (let i = 0; i < f32.length; i++)
    i16[i] = Math.round(f32[i] * 32767);
  const u8 = new Uint8Array(i16.buffer);
  let s = '';
  u8.forEach(b => s += String.fromCharCode(b));
  return btoa(s);
}

ws.send(JSON.stringify({
  audio_event: { audio_base_64: toBase64(chunk) }
}));
ws.send(JSON.stringify({ type: 'end' }));
Handle events
ws.onmessage = ({ data }) => {
  const msg = JSON.parse(data);

  if (msg.type === 'ssap') {
    updateProfile(msg.properties); return;
  }
  if (msg.type === 'error') {
    console.error(msg.message); return;
  }
  if (msg.is_final) sendToLLM(msg.text);
  else              showPartial(msg.text);
};
Partial (is_final: false)
{ "type": "transcript",
  "text": "Xin chào bạn",
  "is_final": false }
End-of-turn (is_final: true)
{ "type": "transcript",
  "text": "Xin chào bạn ơi.",
  "is_final": true }
SSAP event
{ "type": "ssap",
  "properties": {
    "gender":  {"female": 0.92, "male": 0.08},
    "age":     {"adult": 0.75, "young": 0.2},
    "emotion": {"neutral": 0.6, "happy": 0.3}
  }
}
WS  /ws/voice-chat

Interactive Voice Chat

Full ASR → LLM → TTS pipeline over a single WebSocket. Send mic audio, receive agent responses as both text events and binary PCM audio. Supports barge-in, proactive follow-ups, and SSAP voice profiling.

First message must be a config frame. The server replies with {"type":"ready"} after accepting configuration.

Client → Server
config
object
1st message
Session configuration. All fields optional except type.
See schema below.
audio_event
object
PCM audio: audio_event.audio_base_64 — int16, 16 kHz, mono, base64.
text_input
object
{"type":"text_input","text":"..."} — bypass ASR, inject text directly.
end
object
{"type":"end"} — graceful disconnect.

Server → Client
ready
object
{"type":"ready"} — connection established, ready to receive audio.
transcript
object
ASR result: {"type":"transcript","text":"...","is_final":bool}.
thinking
object
{"type":"thinking"} — LLM is generating.
agent_text
object
{"type":"agent_text","text":"..."} — streamed LLM token.
agent_end
object
{"type":"agent_end"} — LLM response complete.
binary audio
binary
Raw PCM frames: int16, 24 kHz, mono. Interleaved with agent_text events.
ssap
object
{"type":"ssap","properties":{gender,age,emotion}} — voice profile (if ssap enabled).
latency
object
{"type":"latency","step":"llm|tts_first","value":"142ms"} — timing metrics.
error
object
{"type":"error","source":"llm|tts|asr","message":"..."}
disconnect
object
{"type":"disconnect","reason":"timeout"} — server-initiated close after inactivity.

Config frame fields
speaker_id
string
optional
TTS voice UUID. Uses server default if omitted.
collection_id
string
optional
Speaker collection UUID. Enables multilang speaker routing.
language
string
optional
Default language code for ASR and TTS (vi, en, fr, ja, ko, zh, ru).
Default: "vi"
tts_ws_url
string
optional
WebSocket URL of TTS service. Uses TTS_WS_URL env if omitted.
asr_ws_url
string
optional
External ASR WebSocket URL. Uses built-in VAD pipeline if omitted.
llm_base_url
string
optional
OpenAI-compatible LLM base URL.
llm_model
string
optional
Model name (e.g. gpt-4o, gpt-4o-mini).
ssap
boolean
optional
Enable voice profiling (gender/age/emotion). Emits ssap events.
Default: false
Connect & Configure
const ws = new WebSocket(
  'ws://localhost:8124/ws/voice-chat'
);

ws.onopen = () => {
  ws.send(JSON.stringify({
    type: 'config',
    speaker_id: 'uuid-...',
    collection_id: 'uuid-...',
    language: 'vi',
    llm_model: 'gpt-4o-mini',
    ssap: true
  }));
};
Handle messages
ws.onmessage = ({ data }) => {
  // Binary = TTS PCM audio
  if (data instanceof ArrayBuffer) {
    playPcm(data, 24000); return;
  }
  const msg = JSON.parse(data);
  switch (msg.type) {
    case 'ready':      onReady(); break;
    case 'transcript': onTranscript(msg); break;
    case 'thinking':   showSpinner(); break;
    case 'agent_text': appendText(msg.text); break;
    case 'agent_end':  onResponseEnd(); break;
    case 'ssap':       updateProfile(msg); break;
    case 'latency':    logLatency(msg); break;
    case 'error':      onError(msg); break;
  }
};
Language detection prefixes
// Server detects lang from ASR output prefix:
// [vi] → Vietnamese TTS speaker
// [en] → English TTS speaker
// [fr] → French TTS speaker
// [ja] → Japanese TTS speaker
// [ko] → Korean TTS speaker
// [zh] → Chinese TTS speaker
// [ru] → Russian TTS speaker
// Uses collection_id to pick correct speaker

Speaker Verification & Anti-Spoof

Register voice prints and verify speaker identity using embedding cosine similarity with AS-Norm scoring. Detect synthetic/replayed audio with ONNX-based anti-spoofing model.

POST  /voice/register

Register Speaker

Upload a reference audio file to create a speaker voice print. Returns a speaker_id UUID for subsequent verification calls.

Request — multipart/form-data
file
binary
required
Reference audio (wav, mp3, flac, ogg). Minimum 3 seconds recommended.
name
string
required
Display name for the speaker.
description
string
required
Short description of the speaker.
Response
status
string
"success"
speaker_id
string
UUID for this speaker voice print.
file_path
string
Stored path of the reference WAV.
Request
const form = new FormData();
form.append('file', audioBlob, 'ref.wav');
form.append('name', 'Nguyen Van A');
form.append('description', 'Customer support');
const res = await fetch(
  'http://localhost:8124/voice/register',
  { method: 'POST', body: form }
);
200 OK
{
  "status": "success",
  "speaker_id": "a1b2c3d4-...",
  "file_path": "speakers/a1b2c3d4-....wav"
}
POST  /voice/verify/{speaker_id}

Verify Against Registered Speaker

Compare one or more audio files against a registered speaker's voice print. Returns a score and ACCEPT/REJECT verdict for each file.

Path Parameters
speaker_id
string
required
UUID from /voice/register.
Request — multipart/form-data
files
binary[]
required
One or more audio files to verify.
Response
results
array
Per-file: file, score (float), verdict ("ACCEPT"|"REJECT"), path.
Threshold: score > 3.5 → ACCEPT (AS-Norm cosine similarity, higher = more similar).
200 OK
{
  "speaker_id": "a1b2c3d4-...",
  "reference": "speakers/a1b2.wav",
  "results": [
    {
      "file": "test1.wav",
      "score": 4.21,
      "verdict": "ACCEPT",
      "path": "tmp/test1.wav"
    },
    {
      "file": "test2.wav",
      "score": 1.03,
      "verdict": "REJECT",
      "path": "tmp/test2.wav"
    }
  ]
}
POST  /voice/verify_with_reference

Verify Against Custom Reference

Ad-hoc verification without prior registration. Upload a reference file and test files in one request.

Request — multipart/form-data
reference
binary
required
Reference audio file.
files
binary[]
required
Test audio files to verify.
Response
reference_file
string
Reference filename.
results
array
Same structure as /voice/verify.
Request
const form = new FormData();
form.append('reference', refBlob, 'ref.wav');
form.append('files',     t1Blob,  'test1.wav');
form.append('files',     t2Blob,  'test2.wav');
const res = await fetch(
  '.../voice/verify_with_reference',
  { method: 'POST', body: form }
);
200 OK
{
  "reference_file": "ref.wav",
  "results": [
    {"file":"test1.wav","score":3.9,"verdict":"ACCEPT"},
    {"file":"test2.wav","score":0.8,"verdict":"REJECT"}
  ]
}
POST  /voice/anti-spoof

Anti-Spoofing Detection

Detect AI-generated or replayed audio to distinguish real human speech from synthetic voices. Uses an ONNX-based model with three strictness modes.

Request — multipart/form-data
files
binary[]
required
Audio files to analyze.
mode
string
optional
Detection strictness:
  • bank_top_security — strictest, low false accept rate
  • bank_flex — balanced (recommended)
  • consumer — lenient, fewer false rejects
Default: bank_flex
Response
results
array
Per-file: file, result ("real"|"fake"), is_real, score, threshold, confidence, qa_passed.
200 OK
{
  "results": [
    {
      "file": "sample.wav",
      "result": "real",
      "is_real": true,
      "score": 0.89,
      "threshold": 0.5,
      "confidence": 0.94,
      "qa_passed": true
    },
    {
      "file": "fake.wav",
      "result": "fake",
      "is_real": false,
      "score": 0.12,
      "threshold": 0.5,
      "confidence": 0.97,
      "qa_passed": true
    }
  ]
}

WebRTC Voice Agent

Real-time bidirectional voice conversation using WebRTC. The client negotiates a peer connection via HTTP signalling endpoints, then sends mic audio and receives agent speech over the peer connection. Text events are delivered through an RTCDataChannel.

POST  /offer

Signal SDP Offer

Exchange SDP with the server to negotiate the peer connection. Send the browser's SDP offer along with voice configuration. The server returns an SDP answer and a pc_id needed for subsequent ICE candidate exchange.

ICE Servers: Use the provided STUN/TURN servers for NAT traversal. The server uses bundlePolicy: "max-bundle" and rtcpMuxPolicy: "require".

RTCPeerConnection Config
STUN
string
stun:webrtc.svisor.vn:3478
TURN / TURNS
string
turn:webrtc.svisor.vn:3478 and turns:webrtc.svisor.vn:3478
username: coturnuser  |  credential: coturnpass@spX2025

Request Body — application/json
sdp
string
required
SDP string from pc.localDescription.sdp.
type
string
required
SDP type — always "offer".
speaker_id
string
optional
TTS speaker UUID for the agent's voice.
user_system_prompt
string
optional
System prompt for the LLM agent.
reference_audio
string
optional
Filename of reference audio for the selected speaker (from /api/speakers).
reference_text
string
optional
Reference transcript matching reference_audio.
Response
pc_id
string
Peer connection ID. Required for all subsequent /ice calls.
sdp
string
Server SDP answer. Pass to pc.setRemoteDescription().
type
string
Always "answer".
Full connection flow
const pc = new RTCPeerConnection({
  iceServers: [
    { urls: 'stun:webrtc.svisor.vn:3478' },
    { urls: 'turns:webrtc.svisor.vn:3478',
      username: 'coturnuser',
      credential: 'coturnpass@spX2025' },
    { urls: 'turn:webrtc.svisor.vn:3478',
      username: 'coturnuser',
      credential: 'coturnpass@spX2025' }
  ],
  bundlePolicy: 'max-bundle',
  rtcpMuxPolicy: 'require'
});

// Add audio transceiver (send + receive)
pc.addTransceiver('audio', { direction: 'sendrecv' });

// Create data channel for text events
const dc = pc.createDataChannel('chat');

// Add local mic track
const stream = await navigator.mediaDevices
  .getUserMedia({ audio: true });
stream.getTracks().forEach(t => pc.addTrack(t, stream));

// Create & send offer
const offer = await pc.createOffer();
await pc.setLocalDescription(offer);

const BASE = '{WEBRTC_API_URL}';
const res = await fetch(BASE + '/offer', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    sdp: offer.sdp, type: offer.type,
    speaker_id: 'uuid-...',
    user_system_prompt: 'You are a helpful assistant.'
  })
});
const answer = await res.json();
const pcId = answer.pc_id;
await pc.setRemoteDescription(answer);
200 OK
{
  "pc_id": "abc123-...",
  "sdp":  "v=0\r\no=- ...",
  "type": "answer"
}
POST  /ice

Signal ICE Candidate

Forward browser ICE candidates to the server as they are generated. Send candidate: null when ICE gathering is complete to signal end-of-candidates. Candidates generated before pc_id is received should be queued and sent after the offer response.

Request Body — application/json
pc_id
string
required
Peer connection ID from /offer response.
candidate
object | null
required
ICE candidate object or null to signal end of gathering.
Fields: candidate, sdpMid, sdpMLineIndex
Send ICE candidates
const BASE = '{WEBRTC_API_URL}';

pc.onicecandidate = async ({ candidate }) => {
  await fetch(BASE + '/ice', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      pc_id: pcId,
      candidate: candidate
        ? {
            candidate:     candidate.candidate,
            sdpMid:        candidate.sdpMid,
            sdpMLineIndex: candidate.sdpMLineIndex
          }
        : null   // null = end of gathering
    })
  });
};
DC  chat (RTCDataChannel)

DataChannel Protocol

Once the peer connection is established, the "chat" DataChannel carries JSON text events in both directions. Bidirectional audio flows over the WebRTC audio track.


Server → Client (receive)
transcription
object
{"type":"transcription","text":"..."} — final transcript of what the user said.
agent_text
object
{"type":"agent_text","text":"..."} — streamed LLM token for the agent's response.

Client → Server (send)
text input
string
Plain text string sent via dc.send(text). Bypasses ASR, injects text directly into the LLM pipeline.

Audio (WebRTC track)
mic → server
audio track
Browser mic audio via sendrecv transceiver. Server runs ASR on the incoming stream.
server → speaker
audio track
TTS-synthesized agent voice received via pc.ontrack. Play through an <audio> element.
Handle DataChannel events
dc.onopen = () => console.log('Connected');

dc.onmessage = ({ data }) => {
  const msg = JSON.parse(data);
  if (msg.type === 'transcription') {
    showUserText(msg.text);
  } else if (msg.type === 'agent_text') {
    appendAgentText(msg.text);
  }
};

// Send text input (bypasses ASR)
dc.send('What is the weather today?');
Play remote audio
const audio = document.createElement('audio');
audio.autoplay = true;

pc.ontrack = ({ streams }) => {
  audio.srcObject = streams[0];
  audio.play();
};
DataChannel events
// User speech → ASR result
{ "type": "transcription",
  "text": "Xin chào bạn." }

// Agent LLM response (streamed)
{ "type": "agent_text",
  "text": "Chào bạn! Tôi có thể" }
{ "type": "agent_text",
  "text": " giúp gì cho bạn?" }

Text-to-Speech & Speaker Collections

Synthesize speech via WebSocket with speaker voice selection. Manage speaker libraries and collections for multilang voice routing.

WS  /ws/tts

TTS Synthesis

Synthesize text to speech. Send a JSON payload, receive binary PCM audio frames followed by an {"type":"end"} JSON message when synthesis is complete.

Client → Server
target_text
string
required
Text to synthesize.
speaker_id
string
optional
Speaker UUID. Uses service default if omitted.
collection_id
string
optional
Collection UUID. Routes to the collection's speaker for the given language.
language
string
optional
Language code (vi, en, fr, ja, ko, zh, ru). Used with collection_id to pick the right speaker.
Default: "vi"

Server → Client
binary frames
binary
Raw PCM audio: int16, 24 kHz, mono. Multiple frames streamed as synthesis progresses.
end event
object
{"type":"end"} — all audio frames have been sent.
Usage
const ws = new WebSocket(
  'ws://localhost:8005/ws/tts'
);

const chunks = [];
ws.onmessage = ({ data }) => {
  if (data instanceof ArrayBuffer) {
    chunks.push(data);
  } else {
    const msg = JSON.parse(data);
    if (msg.type === 'end') playAll(chunks);
  }
};

ws.send(JSON.stringify({
  target_text:   "Xin chào, tôi là Alard.",
  speaker_id:    "uuid-speaker",
  collection_id: "uuid-collection",
  language:      "vi"
}));
GET  /api/speakers

List Speakers

Retrieve all available TTS speaker voice profiles for a user.

Query Parameters
user_id
string
optional
Filter speakers by user. Use default for shared speakers.
Default: "default"
Response
speakers
array
Array of speaker objects with id, name, reference_audio, and language metadata.
200 OK
{
  "speakers": [
    {
      "id": "de315d3d-bb11-...",
      "name": "Giọng Google Nữ",
      "reference_audio": "vi.wav",
      "language": "vi",
      "user_id": "default"
    },
    {
      "id": "ca8df695-175d-...",
      "name": "F5TTS English 2",
      "reference_audio": "en.wav",
      "language": "en",
      "user_id": "default"
    }
  ]
}
GET  /api/collections

List Collections

Retrieve all speaker collections for a user. A collection maps language codes to specific speakers enabling automatic multilang TTS routing.

Query Parameters
user_id
string
optional
Filter by user ID.
Default: "default"
Response
collections
array
Array of collection objects.
200 OK
{
  "collections": [
    {
      "id": "637b6576-a49b-...",
      "name": "Bộ Sưu Tập 1",
      "description": "BST1",
      "user_id": "long",
      "langs": {
        "vi": {"id": "de315d3d-...", "name": "Giọng Google"},
        "en": {"id": "ca8df695-...", "name": "F5TTS English 2"},
        "fr": {"id": "fd8d4043-...", "name": "F5TTS French"},
        "ru": {"id": "f58f853b-...", "name": "F5TTS Russian"}
      }
    }
  ]
}
POST  /api/collections

Create Collection

Create a new speaker collection mapping language codes to speaker UUIDs. Only languages with a speaker assigned need to be included in langs.

Request Body — application/json
name
string
required
Collection display name.
user_id
string
required
Owner user ID. Use "default" for shared.
description
string
optional
Short description.
langs
object
optional
Map of language code → speaker UUID. Keys: vi, en, fr, ja, ko, zh, ru.
Response
id
string
UUID of the created collection.
Request
await fetch('http://localhost:8005/api/collections', {
  method: 'POST',
  headers: {'Content-Type': 'application/json'},
  body: JSON.stringify({
    name: "My Collection",
    user_id: "default",
    description: "Multilang bot voice",
    langs: {
      "vi": "de315d3d-bb11-...",
      "en": "ca8df695-175d-...",
      "fr": "fd8d4043-665a-..."
    }
  })
});
200 OK
{
  "id": "637b6576-a49b-40f2-9172-...",
  "name": "My Collection",
  "user_id": "default"
}
PUT  /api/collections/{id}

Update Collection

Update a collection's name, description, or language-to-speaker mappings.

Path Parameters
id
string
required
Collection UUID.
Request Body — application/json
name
string
optional
New display name.
description
string
optional
New description.
langs
object
optional
New full language map (replaces existing). Keys: vi, en, fr, ja, ko, zh, ru.
Request
await fetch(
  'http://localhost:8005/api/collections/637b...',
  {
    method: 'PUT',
    headers: {'Content-Type':'application/json'},
    body: JSON.stringify({
      name: "Updated Name",
      langs: { "vi": "uuid-vi", "en": "uuid-en" }
    })
  }
);
DELETE  /api/collections/{id}

Delete Collection

Permanently delete a speaker collection. This cannot be undone.

Path Parameters
id
string
required
Collection UUID to delete.
Response
status
string
"deleted" on success.
Request
await fetch(
  'http://localhost:8005/api/collections/637b...',
  { method: 'DELETE' }
);
200 OK
{ "status": "deleted" }

Voice Chat (External ASR)

Standalone voice chat server that bridges to an external ASR WebSocket instead of running its own VAD pipeline. Identical protocol to the main server at :8124.

WS  /ws/voice-chat

Voice Chat — External ASR Bridge

Same protocol as :8124/ws/voice-chat. Designed for use when you want to route ASR through a separate endpoint (e.g. the VAD streaming endpoint at :8124/ws/transcribe/vad). Set asr_ws_url in the config frame to connect the bridge.

Architecture: Client audio → this server → ASR WS → transcript → LLM → TTS WS → binary PCM → client.
When asr_ws_url is set, audio is forwarded to the external ASR. When omitted, built-in VAD is used.
Config frame (subset)
asr_ws_url
string
optional
External ASR WebSocket URL. E.g. ws://localhost:8124/ws/transcribe/vad
tts_ws_url
string
optional
TTS WebSocket URL. E.g. ws://localhost:8005/ws/tts

All other messages and response events are identical to :8124/ws/voice-chat.

Config with external ASR
const ws = new WebSocket(
  'ws://localhost:8125/ws/voice-chat'
);
ws.onopen = () => {
  ws.send(JSON.stringify({
    type: 'config',
    asr_ws_url:
      'ws://localhost:8124/ws/transcribe/vad',
    tts_ws_url:
      'ws://localhost:8005/ws/tts',
    speaker_id:    'uuid-speaker',
    collection_id: 'uuid-collection',
    language:      'vi',
    llm_model:     'gpt-4o-mini'
  }));
};