ASR

Automatic Speech Recognition

Convert audio to text using a fine-tuned Whisper model with GPU acceleration. Single-file transcription via HTTP and real-time streaming with VAD end-of-turn detection via WebSocket.

POST /transcribe

Transcribe Audio File

Upload an audio file and receive the full transcript. Internally resamples to 16 kHz mono before inference. Long files are auto-split using Silero-VAD.

Request — multipart/form-data

file

binary

required

Audio file. Accepted: wav, mp3, flac, ogg, m4a, webm.

language

string

optional

ISO 639-1 code to force language. Omit for auto-detect.

Default: null

Response

filename

string

Original filename.

text

string

Full transcript.

segments

array

Per-segment objects with start, end, text, tokens, avg_logprob, no_speech_prob.

duration

float

Audio duration in seconds.

processing_time

float

Inference time in seconds.

Request

const form = new FormData();
form.append('file', blob, 'audio.wav');
const res = await fetch(
  'http://localhost:8124/transcribe',
  { method: 'POST', body: form }
);
const data = await res.json();

200 OK

{
  "filename": "audio.wav",
  "text": "Xin chào, tôi cần hỗ trợ.",
  "segments": [{
    "id": 0, "start": 0.0, "end": 3.5,
    "text": "Xin chào, tôi cần hỗ trợ.",
    "avg_logprob": -0.15,
    "no_speech_prob": 0.001
  }],
  "duration": 3.84,
  "processing_time": 0.62
}

POST /transcribe-with-property

Transcribe with Diarization

Transcribe with word-level timestamps and speaker diarization. Uses pyannote.audio to identify and label multiple speakers in the recording.

Request — multipart/form-data

file

binary

required

Audio file (wav, mp3, flac, ogg, m4a).

language

string

optional

Language code. Auto-detect if omitted.

Response

dialogues

array

Array of speaker turns, each with speaker label, start/end, text, and word-level timestamps.

processing_time

float

Total processing time in seconds.

200 OK

{
  "dialogues": [
    {
      "speaker": "SPEAKER_00",
      "start": 0.0, "end": 4.2,
      "text": "Xin chào bạn.",
      "words": [
        {"word": "Xin",   "start": 0.05, "end": 0.25},
        {"word": "chào",  "start": 0.28, "end": 0.55},
        {"word": "bạn.",  "start": 0.60, "end": 0.90}
      ]
    },
    {
      "speaker": "SPEAKER_01",
      "start": 4.5, "end": 7.1,
      "text": "Vâng, tôi nghe rồi.",
      "words": [...]
    }
  ],
  "processing_time": 12.5
}

WS /ws/transcribe/vad

Real-time Streaming Transcription

Stream PCM audio and receive partial transcripts as the user speaks, followed by a final is_final event when silence is detected. Uses Silero-VAD for end-of-turn detection.

Recommended: 100ms chunks with silence_ms=300 for minimum latency. Chunk size must be < silence threshold.

Query Parameters

ssap

bool

optional

Enable Semantic Social Audio Profiler. Emits voice properties (gender, age, emotion) alongside transcripts.

Default: false

Client → Server

audio_event

object

PCM audio chunk. Must be int16, 16 kHz, mono, base64-encoded in audio_event.audio_base_64.

config

object

{"type":"config","ssap":true} — Enable SSAP mid-stream.

end

object

{"type":"end"} — Flush buffer, emit final, close.

Server → Client

type

"transcript" | "ssap" | "error"

Event type.

text

string

Transcribed text (present on transcript events).

is_final

boolean

true = end-of-turn (silence > 300 ms). Forward to LLM. false = partial preview.

properties

object

On type:"ssap" — voice profile: {gender, age, emotion} each as Record<string, float>.

Server-side defaults

silence_ms

int

End-of-turn silence threshold.

Default: 300

volume_threshold

float

RMS amplitude gate (~-40 dBFS). Filters breath and background noise.

Default: 0.01

partial_interval

float

Minimum voiced audio (sec) before emitting a partial.

Default: 1.0s

Connect & Stream

const ws = new WebSocket(
  'ws://localhost:8124/ws/transcribe/vad'
);

// Float32 mic → Int16 → base64
function toBase64(f32) {
  const i16 = new Int16Array(f32.length);
  for (let i = 0; i < f32.length; i++)
    i16[i] = Math.round(f32[i] * 32767);
  const u8 = new Uint8Array(i16.buffer);
  let s = '';
  u8.forEach(b => s += String.fromCharCode(b));
  return btoa(s);
}

ws.send(JSON.stringify({
  audio_event: { audio_base_64: toBase64(chunk) }
}));
ws.send(JSON.stringify({ type: 'end' }));

Handle events

ws.onmessage = ({ data }) => {
  const msg = JSON.parse(data);

  if (msg.type === 'ssap') {
    updateProfile(msg.properties); return;
  }
  if (msg.type === 'error') {
    console.error(msg.message); return;
  }
  if (msg.is_final) sendToLLM(msg.text);
  else              showPartial(msg.text);
};

Partial (is_final: false)

{ "type": "transcript",
  "text": "Xin chào bạn",
  "is_final": false }

End-of-turn (is_final: true)

{ "type": "transcript",
  "text": "Xin chào bạn ơi.",
  "is_final": true }

SSAP event

{ "type": "ssap",
  "properties": {
    "gender":  {"female": 0.92, "male": 0.08},
    "age":     {"adult": 0.75, "young": 0.2},
    "emotion": {"neutral": 0.6, "happy": 0.3}
  }
}

WS /ws/voice-chat

Interactive Voice Chat

Full ASR → LLM → TTS pipeline over a single WebSocket. Send mic audio, receive agent responses as both text events and binary PCM audio. Supports barge-in, proactive follow-ups, and SSAP voice profiling.

First message must be a config frame. The server replies with {"type":"ready"} after accepting configuration.

Client → Server

config

object

1st message

Session configuration. All fields optional except type.

See schema below.

audio_event

object

PCM audio: audio_event.audio_base_64 — int16, 16 kHz, mono, base64.

text_input

object

{"type":"text_input","text":"..."} — bypass ASR, inject text directly.

end

object

{"type":"end"} — graceful disconnect.

Server → Client

ready

object

{"type":"ready"} — connection established, ready to receive audio.

transcript

object

ASR result: {"type":"transcript","text":"...","is_final":bool}.

thinking

object

{"type":"thinking"} — LLM is generating.

agent_text

object

{"type":"agent_text","text":"..."} — streamed LLM token.

agent_end

object

{"type":"agent_end"} — LLM response complete.

binary audio

binary

Raw PCM frames: int16, 24 kHz, mono. Interleaved with agent_text events.

ssap

object

{"type":"ssap","properties":{gender,age,emotion}} — voice profile (if ssap enabled).

latency

object

{"type":"latency","step":"llm|tts_first","value":"142ms"} — timing metrics.

error

object

{"type":"error","source":"llm|tts|asr","message":"..."}

disconnect

object

{"type":"disconnect","reason":"timeout"} — server-initiated close after inactivity.

Config frame fields

speaker_id

string

optional

TTS voice UUID. Uses server default if omitted.

collection_id

string

optional

Speaker collection UUID. Enables multilang speaker routing.

language

string

optional

Default language code for ASR and TTS (vi, en, fr, ja, ko, zh, ru).

Default: "vi"

tts_ws_url

string

optional

WebSocket URL of TTS service. Uses TTS_WS_URL env if omitted.

asr_ws_url

string

optional

External ASR WebSocket URL. Uses built-in VAD pipeline if omitted.

llm_base_url

string

optional

OpenAI-compatible LLM base URL.

llm_model

string

optional

Model name (e.g. gpt-4o, gpt-4o-mini).

ssap

boolean

optional

Enable voice profiling (gender/age/emotion). Emits ssap events.

Default: false

Connect & Configure

const ws = new WebSocket(
  'ws://localhost:8124/ws/voice-chat'
);

ws.onopen = () => {
  ws.send(JSON.stringify({
    type: 'config',
    speaker_id: 'uuid-...',
    collection_id: 'uuid-...',
    language: 'vi',
    llm_model: 'gpt-4o-mini',
    ssap: true
  }));
};

Handle messages

ws.onmessage = ({ data }) => {
  // Binary = TTS PCM audio
  if (data instanceof ArrayBuffer) {
    playPcm(data, 24000); return;
  }
  const msg = JSON.parse(data);
  switch (msg.type) {
    case 'ready':      onReady(); break;
    case 'transcript': onTranscript(msg); break;
    case 'thinking':   showSpinner(); break;
    case 'agent_text': appendText(msg.text); break;
    case 'agent_end':  onResponseEnd(); break;
    case 'ssap':       updateProfile(msg); break;
    case 'latency':    logLatency(msg); break;
    case 'error':      onError(msg); break;
  }
};

Language detection prefixes

// Server detects lang from ASR output prefix:
// [vi] → Vietnamese TTS speaker
// [en] → English TTS speaker
// [fr] → French TTS speaker
// [ja] → Japanese TTS speaker
// [ko] → Korean TTS speaker
// [zh] → Chinese TTS speaker
// [ru] → Russian TTS speaker
// Uses collection_id to pick correct speaker

Voice Biometrics

Speaker Verification & Anti-Spoof

Register voice prints and verify speaker identity using embedding cosine similarity with AS-Norm scoring. Detect synthetic/replayed audio with ONNX-based anti-spoofing model.

POST /voice/register

Register Speaker

Upload a reference audio file to create a speaker voice print. Returns a speaker_id UUID for subsequent verification calls.

Request — multipart/form-data

file

binary

required

Reference audio (wav, mp3, flac, ogg). Minimum 3 seconds recommended.

name

string

required

Display name for the speaker.

description

string

required

Short description of the speaker.

Response

status

string

"success"

speaker_id

string

UUID for this speaker voice print.

file_path

string

Stored path of the reference WAV.

Request

const form = new FormData();
form.append('file', audioBlob, 'ref.wav');
form.append('name', 'Nguyen Van A');
form.append('description', 'Customer support');
const res = await fetch(
  'http://localhost:8124/voice/register',
  { method: 'POST', body: form }
);

200 OK

{
  "status": "success",
  "speaker_id": "a1b2c3d4-...",
  "file_path": "speakers/a1b2c3d4-....wav"
}

POST /voice/verify/{speaker_id}

Verify Against Registered Speaker

Compare one or more audio files against a registered speaker's voice print. Returns a score and ACCEPT/REJECT verdict for each file.

Path Parameters

speaker_id

string

required

UUID from /voice/register.

Request — multipart/form-data

files

binary[]

required

One or more audio files to verify.

Response

results

array

Per-file: file, score (float), verdict ("ACCEPT"|"REJECT"), path.

Threshold: score > 3.5 → ACCEPT (AS-Norm cosine similarity, higher = more similar).

200 OK

{
  "speaker_id": "a1b2c3d4-...",
  "reference": "speakers/a1b2.wav",
  "results": [
    {
      "file": "test1.wav",
      "score": 4.21,
      "verdict": "ACCEPT",
      "path": "tmp/test1.wav"
    },
    {
      "file": "test2.wav",
      "score": 1.03,
      "verdict": "REJECT",
      "path": "tmp/test2.wav"
    }
  ]
}

POST /voice/verify_with_reference

Verify Against Custom Reference

Ad-hoc verification without prior registration. Upload a reference file and test files in one request.

Request — multipart/form-data

reference

binary

required

Reference audio file.

files

binary[]

required

Test audio files to verify.

Response

reference_file

string

Reference filename.

results

array

Same structure as /voice/verify.

Request

const form = new FormData();
form.append('reference', refBlob, 'ref.wav');
form.append('files',     t1Blob,  'test1.wav');
form.append('files',     t2Blob,  'test2.wav');
const res = await fetch(
  '.../voice/verify_with_reference',
  { method: 'POST', body: form }
);

200 OK

{
  "reference_file": "ref.wav",
  "results": [
    {"file":"test1.wav","score":3.9,"verdict":"ACCEPT"},
    {"file":"test2.wav","score":0.8,"verdict":"REJECT"}
  ]
}

POST /voice/anti-spoof

Anti-Spoofing Detection

Detect AI-generated or replayed audio to distinguish real human speech from synthetic voices. Uses an ONNX-based model with three strictness modes.

Request — multipart/form-data

files

binary[]

required

Audio files to analyze.

mode

string

optional

Detection strictness:

bank_top_security — strictest, low false accept rate
bank_flex — balanced (recommended)
consumer — lenient, fewer false rejects

Default: bank_flex

Response

results

array

Per-file: file, result ("real"|"fake"), is_real, score, threshold, confidence, qa_passed.

200 OK

{
  "results": [
    {
      "file": "sample.wav",
      "result": "real",
      "is_real": true,
      "score": 0.89,
      "threshold": 0.5,
      "confidence": 0.94,
      "qa_passed": true
    },
    {
      "file": "fake.wav",
      "result": "fake",
      "is_real": false,
      "score": 0.12,
      "threshold": 0.5,
      "confidence": 0.97,
      "qa_passed": true
    }
  ]
}

WebRTC Service

WebRTC Voice Agent

Real-time bidirectional voice conversation using WebRTC. The client negotiates a peer connection via HTTP signalling endpoints, then sends mic audio and receives agent speech over the peer connection. Text events are delivered through an RTCDataChannel.

POST /offer

Signal SDP Offer

Exchange SDP with the server to negotiate the peer connection. Send the browser's SDP offer along with voice configuration. The server returns an SDP answer and a pc_id needed for subsequent ICE candidate exchange.

ICE Servers: Use the provided STUN/TURN servers for NAT traversal. The server uses bundlePolicy: "max-bundle" and rtcpMuxPolicy: "require".

RTCPeerConnection Config

STUN

string

stun:webrtc.svisor.vn:3478

TURN / TURNS

string

turn:webrtc.svisor.vn:3478 and turns:webrtc.svisor.vn:3478

username: coturnuser | credential: coturnpass@spX2025

Request Body — application/json

sdp

string

required

SDP string from pc.localDescription.sdp.

type

string

required

SDP type — always "offer".

speaker_id

string

optional

TTS speaker UUID for the agent's voice.

user_system_prompt

string

optional

System prompt for the LLM agent.

reference_audio

string

optional

Filename of reference audio for the selected speaker (from /api/speakers).

reference_text

string

optional

Reference transcript matching reference_audio.

Response

pc_id

string

Peer connection ID. Required for all subsequent /ice calls.

sdp

string

Server SDP answer. Pass to pc.setRemoteDescription().

type

string

Always "answer".

Full connection flow

const pc = new RTCPeerConnection({
  iceServers: [
    { urls: 'stun:webrtc.svisor.vn:3478' },
    { urls: 'turns:webrtc.svisor.vn:3478',
      username: 'coturnuser',
      credential: 'coturnpass@spX2025' },
    { urls: 'turn:webrtc.svisor.vn:3478',
      username: 'coturnuser',
      credential: 'coturnpass@spX2025' }
  ],
  bundlePolicy: 'max-bundle',
  rtcpMuxPolicy: 'require'
});

// Add audio transceiver (send + receive)
pc.addTransceiver('audio', { direction: 'sendrecv' });

// Create data channel for text events
const dc = pc.createDataChannel('chat');

// Add local mic track
const stream = await navigator.mediaDevices
  .getUserMedia({ audio: true });
stream.getTracks().forEach(t => pc.addTrack(t, stream));

// Create & send offer
const offer = await pc.createOffer();
await pc.setLocalDescription(offer);

const BASE = '{WEBRTC_API_URL}';
const res = await fetch(BASE + '/offer', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    sdp: offer.sdp, type: offer.type,
    speaker_id: 'uuid-...',
    user_system_prompt: 'You are a helpful assistant.'
  })
});
const answer = await res.json();
const pcId = answer.pc_id;
await pc.setRemoteDescription(answer);

200 OK

{
  "pc_id": "abc123-...",
  "sdp":  "v=0\r\no=- ...",
  "type": "answer"
}

POST /ice

Signal ICE Candidate

Forward browser ICE candidates to the server as they are generated. Send candidate: null when ICE gathering is complete to signal end-of-candidates. Candidates generated before pc_id is received should be queued and sent after the offer response.

Request Body — application/json

pc_id

string

required

Peer connection ID from /offer response.

candidate

object | null

required

ICE candidate object or null to signal end of gathering.

Fields: candidate, sdpMid, sdpMLineIndex

Send ICE candidates

const BASE = '{WEBRTC_API_URL}';

pc.onicecandidate = async ({ candidate }) => {
  await fetch(BASE + '/ice', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      pc_id: pcId,
      candidate: candidate
        ? {
            candidate:     candidate.candidate,
            sdpMid:        candidate.sdpMid,
            sdpMLineIndex: candidate.sdpMLineIndex
          }
        : null   // null = end of gathering
    })
  });
};

DC chat (RTCDataChannel)

DataChannel Protocol

Once the peer connection is established, the "chat" DataChannel carries JSON text events in both directions. Bidirectional audio flows over the WebRTC audio track.

Server → Client (receive)

transcription

object

{"type":"transcription","text":"..."} — final transcript of what the user said.

agent_text

object

{"type":"agent_text","text":"..."} — streamed LLM token for the agent's response.

Client → Server (send)

text input

string

Plain text string sent via dc.send(text). Bypasses ASR, injects text directly into the LLM pipeline.

Audio (WebRTC track)

mic → server

audio track

Browser mic audio via sendrecv transceiver. Server runs ASR on the incoming stream.

server → speaker

audio track

TTS-synthesized agent voice received via pc.ontrack. Play through an <audio> element.

Handle DataChannel events

dc.onopen = () => console.log('Connected');

dc.onmessage = ({ data }) => {
  const msg = JSON.parse(data);
  if (msg.type === 'transcription') {
    showUserText(msg.text);
  } else if (msg.type === 'agent_text') {
    appendAgentText(msg.text);
  }
};

// Send text input (bypasses ASR)
dc.send('What is the weather today?');

Play remote audio

const audio = document.createElement('audio');
audio.autoplay = true;

pc.ontrack = ({ streams }) => {
  audio.srcObject = streams[0];
  audio.play();
};

DataChannel events

// User speech → ASR result
{ "type": "transcription",
  "text": "Xin chào bạn." }

// Agent LLM response (streamed)
{ "type": "agent_text",
  "text": "Chào bạn! Tôi có thể" }
{ "type": "agent_text",
  "text": " giúp gì cho bạn?" }

TTS & Speakers

Text-to-Speech & Speaker Collections

Synthesize speech via WebSocket with speaker voice selection. Manage speaker libraries and collections for multilang voice routing.

WS /ws/tts

TTS Synthesis

Synthesize text to speech. Send a JSON payload, receive binary PCM audio frames followed by an {"type":"end"} JSON message when synthesis is complete.

Client → Server

target_text

string

required

Text to synthesize.

speaker_id

string

optional

Speaker UUID. Uses service default if omitted.

collection_id

string

optional

Collection UUID. Routes to the collection's speaker for the given language.

language

string

optional

Language code (vi, en, fr, ja, ko, zh, ru). Used with collection_id to pick the right speaker.

Default: "vi"

Server → Client

binary frames

binary

Raw PCM audio: int16, 24 kHz, mono. Multiple frames streamed as synthesis progresses.

end event

object

{"type":"end"} — all audio frames have been sent.

Usage

const ws = new WebSocket(
  'ws://localhost:8005/ws/tts'
);

const chunks = [];
ws.onmessage = ({ data }) => {
  if (data instanceof ArrayBuffer) {
    chunks.push(data);
  } else {
    const msg = JSON.parse(data);
    if (msg.type === 'end') playAll(chunks);
  }
};

ws.send(JSON.stringify({
  target_text:   "Xin chào, tôi là Alard.",
  speaker_id:    "uuid-speaker",
  collection_id: "uuid-collection",
  language:      "vi"
}));

GET /api/speakers

List Speakers

Retrieve all available TTS speaker voice profiles for a user.

Query Parameters

user_id

string

optional

Filter speakers by user. Use default for shared speakers.

Default: "default"

Response

speakers

array

Array of speaker objects with id, name, reference_audio, and language metadata.

200 OK

{
  "speakers": [
    {
      "id": "de315d3d-bb11-...",
      "name": "Giọng Google Nữ",
      "reference_audio": "vi.wav",
      "language": "vi",
      "user_id": "default"
    },
    {
      "id": "ca8df695-175d-...",
      "name": "F5TTS English 2",
      "reference_audio": "en.wav",
      "language": "en",
      "user_id": "default"
    }
  ]
}

GET /api/collections

List Collections

Retrieve all speaker collections for a user. A collection maps language codes to specific speakers enabling automatic multilang TTS routing.

Query Parameters

user_id

string

optional

Filter by user ID.

Default: "default"

Response

collections

array

Array of collection objects.

200 OK

{
  "collections": [
    {
      "id": "637b6576-a49b-...",
      "name": "Bộ Sưu Tập 1",
      "description": "BST1",
      "user_id": "long",
      "langs": {
        "vi": {"id": "de315d3d-...", "name": "Giọng Google"},
        "en": {"id": "ca8df695-...", "name": "F5TTS English 2"},
        "fr": {"id": "fd8d4043-...", "name": "F5TTS French"},
        "ru": {"id": "f58f853b-...", "name": "F5TTS Russian"}
      }
    }
  ]
}

POST /api/collections

Create Collection

Create a new speaker collection mapping language codes to speaker UUIDs. Only languages with a speaker assigned need to be included in langs.

Request Body — application/json

name

string

required

Collection display name.

user_id

string

required

Owner user ID. Use "default" for shared.

description

string

optional

Short description.

langs

object

optional

Map of language code → speaker UUID. Keys: vi, en, fr, ja, ko, zh, ru.

Response

string

UUID of the created collection.

Request

await fetch('http://localhost:8005/api/collections', {
  method: 'POST',
  headers: {'Content-Type': 'application/json'},
  body: JSON.stringify({
    name: "My Collection",
    user_id: "default",
    description: "Multilang bot voice",
    langs: {
      "vi": "de315d3d-bb11-...",
      "en": "ca8df695-175d-...",
      "fr": "fd8d4043-665a-..."
    }
  })
});

200 OK

{
  "id": "637b6576-a49b-40f2-9172-...",
  "name": "My Collection",
  "user_id": "default"
}

PUT /api/collections/{id}

Update Collection

Update a collection's name, description, or language-to-speaker mappings.

Path Parameters

string

required

Collection UUID.

Request Body — application/json

name

string

optional

New display name.

description

string

optional

New description.

langs

object

optional

New full language map (replaces existing). Keys: vi, en, fr, ja, ko, zh, ru.

Request

await fetch(
  'http://localhost:8005/api/collections/637b...',
  {
    method: 'PUT',
    headers: {'Content-Type':'application/json'},
    body: JSON.stringify({
      name: "Updated Name",
      langs: { "vi": "uuid-vi", "en": "uuid-en" }
    })
  }
);

DELETE /api/collections/{id}

Delete Collection

Permanently delete a speaker collection. This cannot be undone.

Path Parameters

string

required

Collection UUID to delete.

Response

status

string

"deleted" on success.

Request

await fetch(
  'http://localhost:8005/api/collections/637b...',
  { method: 'DELETE' }
);

200 OK

{ "status": "deleted" }

Voice Chat Server

Voice Chat (External ASR)

Standalone voice chat server that bridges to an external ASR WebSocket instead of running its own VAD pipeline. Identical protocol to the main server at :8124.

WS /ws/voice-chat

Voice Chat — External ASR Bridge

Same protocol as :8124/ws/voice-chat. Designed for use when you want to route ASR through a separate endpoint (e.g. the VAD streaming endpoint at :8124/ws/transcribe/vad). Set asr_ws_url in the config frame to connect the bridge.

Architecture: Client audio → this server → ASR WS → transcript → LLM → TTS WS → binary PCM → client.
When asr_ws_url is set, audio is forwarded to the external ASR. When omitted, built-in VAD is used.

Config frame (subset)

asr_ws_url

string

optional

External ASR WebSocket URL. E.g. ws://localhost:8124/ws/transcribe/vad

tts_ws_url

string

optional

TTS WebSocket URL. E.g. ws://localhost:8005/ws/tts

All other messages and response events are identical to :8124/ws/voice-chat.

Config with external ASR

const ws = new WebSocket(
  'ws://localhost:8125/ws/voice-chat'
);
ws.onopen = () => {
  ws.send(JSON.stringify({
    type: 'config',
    asr_ws_url:
      'ws://localhost:8124/ws/transcribe/vad',
    tts_ws_url:
      'ws://localhost:8005/ws/tts',
    speaker_id:    'uuid-speaker',
    collection_id: 'uuid-collection',
    language:      'vi',
    llm_model:     'gpt-4o-mini'
  }));
};