Automatic Speech Recognition
Convert audio to text using a fine-tuned Whisper model with GPU acceleration. Single-file transcription via HTTP and real-time streaming with VAD end-of-turn detection via WebSocket.
Transcribe Audio File
Upload an audio file and receive the full transcript. Internally resamples to 16 kHz mono before inference. Long files are auto-split using Silero-VAD.
nullconst form = new FormData();
form.append('file', blob, 'audio.wav');
const res = await fetch(
'http://localhost:8124/transcribe',
{ method: 'POST', body: form }
);
const data = await res.json();{
"filename": "audio.wav",
"text": "Xin chào, tôi cần hỗ trợ.",
"segments": [{
"id": 0, "start": 0.0, "end": 3.5,
"text": "Xin chào, tôi cần hỗ trợ.",
"avg_logprob": -0.15,
"no_speech_prob": 0.001
}],
"duration": 3.84,
"processing_time": 0.62
}Transcribe with Diarization
Transcribe with word-level timestamps and speaker diarization. Uses pyannote.audio to identify and label multiple speakers in the recording.
{
"dialogues": [
{
"speaker": "SPEAKER_00",
"start": 0.0, "end": 4.2,
"text": "Xin chào bạn.",
"words": [
{"word": "Xin", "start": 0.05, "end": 0.25},
{"word": "chào", "start": 0.28, "end": 0.55},
{"word": "bạn.", "start": 0.60, "end": 0.90}
]
},
{
"speaker": "SPEAKER_01",
"start": 4.5, "end": 7.1,
"text": "Vâng, tôi nghe rồi.",
"words": [...]
}
],
"processing_time": 12.5
}Real-time Streaming Transcription
Stream PCM audio and receive partial transcripts as the user speaks, followed by a final is_final event when silence is detected. Uses Silero-VAD for end-of-turn detection.
falseaudio_event.audio_base_64.{"type":"config","ssap":true} — Enable SSAP mid-stream.{"type":"end"} — Flush buffer, emit final, close.true = end-of-turn (silence > 300 ms). Forward to LLM. false = partial preview.type:"ssap" — voice profile: {gender, age, emotion} each as Record<string, float>.3000.011.0sconst ws = new WebSocket(
'ws://localhost:8124/ws/transcribe/vad'
);
// Float32 mic → Int16 → base64
function toBase64(f32) {
const i16 = new Int16Array(f32.length);
for (let i = 0; i < f32.length; i++)
i16[i] = Math.round(f32[i] * 32767);
const u8 = new Uint8Array(i16.buffer);
let s = '';
u8.forEach(b => s += String.fromCharCode(b));
return btoa(s);
}
ws.send(JSON.stringify({
audio_event: { audio_base_64: toBase64(chunk) }
}));
ws.send(JSON.stringify({ type: 'end' }));ws.onmessage = ({ data }) => {
const msg = JSON.parse(data);
if (msg.type === 'ssap') {
updateProfile(msg.properties); return;
}
if (msg.type === 'error') {
console.error(msg.message); return;
}
if (msg.is_final) sendToLLM(msg.text);
else showPartial(msg.text);
};{ "type": "transcript",
"text": "Xin chào bạn",
"is_final": false }{ "type": "transcript",
"text": "Xin chào bạn ơi.",
"is_final": true }{ "type": "ssap",
"properties": {
"gender": {"female": 0.92, "male": 0.08},
"age": {"adult": 0.75, "young": 0.2},
"emotion": {"neutral": 0.6, "happy": 0.3}
}
}Interactive Voice Chat
Full ASR → LLM → TTS pipeline over a single WebSocket. Send mic audio, receive agent responses as both text events and binary PCM audio. Supports barge-in, proactive follow-ups, and SSAP voice profiling.
{"type":"ready"} after accepting configuration.type.audio_event.audio_base_64 — int16, 16 kHz, mono, base64.{"type":"text_input","text":"..."} — bypass ASR, inject text directly.{"type":"end"} — graceful disconnect.{"type":"ready"} — connection established, ready to receive audio.{"type":"transcript","text":"...","is_final":bool}.{"type":"thinking"} — LLM is generating.{"type":"agent_text","text":"..."} — streamed LLM token.{"type":"agent_end"} — LLM response complete.{"type":"ssap","properties":{gender,age,emotion}} — voice profile (if ssap enabled).{"type":"latency","step":"llm|tts_first","value":"142ms"} — timing metrics.{"type":"error","source":"llm|tts|asr","message":"..."}{"type":"disconnect","reason":"timeout"} — server-initiated close after inactivity."vi"TTS_WS_URL env if omitted.gpt-4o, gpt-4o-mini).falseconst ws = new WebSocket(
'ws://localhost:8124/ws/voice-chat'
);
ws.onopen = () => {
ws.send(JSON.stringify({
type: 'config',
speaker_id: 'uuid-...',
collection_id: 'uuid-...',
language: 'vi',
llm_model: 'gpt-4o-mini',
ssap: true
}));
};ws.onmessage = ({ data }) => {
// Binary = TTS PCM audio
if (data instanceof ArrayBuffer) {
playPcm(data, 24000); return;
}
const msg = JSON.parse(data);
switch (msg.type) {
case 'ready': onReady(); break;
case 'transcript': onTranscript(msg); break;
case 'thinking': showSpinner(); break;
case 'agent_text': appendText(msg.text); break;
case 'agent_end': onResponseEnd(); break;
case 'ssap': updateProfile(msg); break;
case 'latency': logLatency(msg); break;
case 'error': onError(msg); break;
}
};// Server detects lang from ASR output prefix: // [vi] → Vietnamese TTS speaker // [en] → English TTS speaker // [fr] → French TTS speaker // [ja] → Japanese TTS speaker // [ko] → Korean TTS speaker // [zh] → Chinese TTS speaker // [ru] → Russian TTS speaker // Uses collection_id to pick correct speaker
Speaker Verification & Anti-Spoof
Register voice prints and verify speaker identity using embedding cosine similarity with AS-Norm scoring. Detect synthetic/replayed audio with ONNX-based anti-spoofing model.
Register Speaker
Upload a reference audio file to create a speaker voice print. Returns a speaker_id UUID for subsequent verification calls.
const form = new FormData();
form.append('file', audioBlob, 'ref.wav');
form.append('name', 'Nguyen Van A');
form.append('description', 'Customer support');
const res = await fetch(
'http://localhost:8124/voice/register',
{ method: 'POST', body: form }
);{
"status": "success",
"speaker_id": "a1b2c3d4-...",
"file_path": "speakers/a1b2c3d4-....wav"
}Verify Against Registered Speaker
Compare one or more audio files against a registered speaker's voice print. Returns a score and ACCEPT/REJECT verdict for each file.
/voice/register.file, score (float), verdict ("ACCEPT"|"REJECT"), path.{
"speaker_id": "a1b2c3d4-...",
"reference": "speakers/a1b2.wav",
"results": [
{
"file": "test1.wav",
"score": 4.21,
"verdict": "ACCEPT",
"path": "tmp/test1.wav"
},
{
"file": "test2.wav",
"score": 1.03,
"verdict": "REJECT",
"path": "tmp/test2.wav"
}
]
}Verify Against Custom Reference
Ad-hoc verification without prior registration. Upload a reference file and test files in one request.
/voice/verify.const form = new FormData();
form.append('reference', refBlob, 'ref.wav');
form.append('files', t1Blob, 'test1.wav');
form.append('files', t2Blob, 'test2.wav');
const res = await fetch(
'.../voice/verify_with_reference',
{ method: 'POST', body: form }
);{
"reference_file": "ref.wav",
"results": [
{"file":"test1.wav","score":3.9,"verdict":"ACCEPT"},
{"file":"test2.wav","score":0.8,"verdict":"REJECT"}
]
}Anti-Spoofing Detection
Detect AI-generated or replayed audio to distinguish real human speech from synthetic voices. Uses an ONNX-based model with three strictness modes.
bank_top_security— strictest, low false accept ratebank_flex— balanced (recommended)consumer— lenient, fewer false rejects
bank_flexfile, result ("real"|"fake"), is_real, score, threshold, confidence, qa_passed.{
"results": [
{
"file": "sample.wav",
"result": "real",
"is_real": true,
"score": 0.89,
"threshold": 0.5,
"confidence": 0.94,
"qa_passed": true
},
{
"file": "fake.wav",
"result": "fake",
"is_real": false,
"score": 0.12,
"threshold": 0.5,
"confidence": 0.97,
"qa_passed": true
}
]
}WebRTC Voice Agent
Real-time bidirectional voice conversation using WebRTC. The client negotiates a peer connection via HTTP signalling endpoints, then sends mic audio and receives agent speech over the peer connection. Text events are delivered through an RTCDataChannel.
Signal SDP Offer
Exchange SDP with the server to negotiate the peer connection. Send the browser's SDP offer along with voice configuration. The server returns an SDP answer and a pc_id needed for subsequent ICE candidate exchange.
bundlePolicy: "max-bundle" and rtcpMuxPolicy: "require".stun:webrtc.svisor.vn:3478turn:webrtc.svisor.vn:3478 and turns:webrtc.svisor.vn:3478coturnuser | credential: coturnpass@spX2025pc.localDescription.sdp."offer"./api/speakers).reference_audio./ice calls.pc.setRemoteDescription()."answer".const pc = new RTCPeerConnection({
iceServers: [
{ urls: 'stun:webrtc.svisor.vn:3478' },
{ urls: 'turns:webrtc.svisor.vn:3478',
username: 'coturnuser',
credential: 'coturnpass@spX2025' },
{ urls: 'turn:webrtc.svisor.vn:3478',
username: 'coturnuser',
credential: 'coturnpass@spX2025' }
],
bundlePolicy: 'max-bundle',
rtcpMuxPolicy: 'require'
});
// Add audio transceiver (send + receive)
pc.addTransceiver('audio', { direction: 'sendrecv' });
// Create data channel for text events
const dc = pc.createDataChannel('chat');
// Add local mic track
const stream = await navigator.mediaDevices
.getUserMedia({ audio: true });
stream.getTracks().forEach(t => pc.addTrack(t, stream));
// Create & send offer
const offer = await pc.createOffer();
await pc.setLocalDescription(offer);
const BASE = '{WEBRTC_API_URL}';
const res = await fetch(BASE + '/offer', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
sdp: offer.sdp, type: offer.type,
speaker_id: 'uuid-...',
user_system_prompt: 'You are a helpful assistant.'
})
});
const answer = await res.json();
const pcId = answer.pc_id;
await pc.setRemoteDescription(answer);{
"pc_id": "abc123-...",
"sdp": "v=0\r\no=- ...",
"type": "answer"
}Signal ICE Candidate
Forward browser ICE candidates to the server as they are generated. Send candidate: null when ICE gathering is complete to signal end-of-candidates. Candidates generated before pc_id is received should be queued and sent after the offer response.
/offer response.null to signal end of gathering.candidate, sdpMid, sdpMLineIndexconst BASE = '{WEBRTC_API_URL}';
pc.onicecandidate = async ({ candidate }) => {
await fetch(BASE + '/ice', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
pc_id: pcId,
candidate: candidate
? {
candidate: candidate.candidate,
sdpMid: candidate.sdpMid,
sdpMLineIndex: candidate.sdpMLineIndex
}
: null // null = end of gathering
})
});
};DataChannel Protocol
Once the peer connection is established, the "chat" DataChannel carries JSON text events in both directions. Bidirectional audio flows over the WebRTC audio track.
{"type":"transcription","text":"..."} — final transcript of what the user said.{"type":"agent_text","text":"..."} — streamed LLM token for the agent's response.dc.send(text). Bypasses ASR, injects text directly into the LLM pipeline.sendrecv transceiver. Server runs ASR on the incoming stream.pc.ontrack. Play through an <audio> element.dc.onopen = () => console.log('Connected');
dc.onmessage = ({ data }) => {
const msg = JSON.parse(data);
if (msg.type === 'transcription') {
showUserText(msg.text);
} else if (msg.type === 'agent_text') {
appendAgentText(msg.text);
}
};
// Send text input (bypasses ASR)
dc.send('What is the weather today?');const audio = document.createElement('audio');
audio.autoplay = true;
pc.ontrack = ({ streams }) => {
audio.srcObject = streams[0];
audio.play();
};// User speech → ASR result
{ "type": "transcription",
"text": "Xin chào bạn." }
// Agent LLM response (streamed)
{ "type": "agent_text",
"text": "Chào bạn! Tôi có thể" }
{ "type": "agent_text",
"text": " giúp gì cho bạn?" }Text-to-Speech & Speaker Collections
Synthesize speech via WebSocket with speaker voice selection. Manage speaker libraries and collections for multilang voice routing.
TTS Synthesis
Synthesize text to speech. Send a JSON payload, receive binary PCM audio frames followed by an {"type":"end"} JSON message when synthesis is complete.
collection_id to pick the right speaker."vi"{"type":"end"} — all audio frames have been sent.const ws = new WebSocket(
'ws://localhost:8005/ws/tts'
);
const chunks = [];
ws.onmessage = ({ data }) => {
if (data instanceof ArrayBuffer) {
chunks.push(data);
} else {
const msg = JSON.parse(data);
if (msg.type === 'end') playAll(chunks);
}
};
ws.send(JSON.stringify({
target_text: "Xin chào, tôi là Alard.",
speaker_id: "uuid-speaker",
collection_id: "uuid-collection",
language: "vi"
}));List Speakers
Retrieve all available TTS speaker voice profiles for a user.
default for shared speakers."default"id, name, reference_audio, and language metadata.{
"speakers": [
{
"id": "de315d3d-bb11-...",
"name": "Giọng Google Nữ",
"reference_audio": "vi.wav",
"language": "vi",
"user_id": "default"
},
{
"id": "ca8df695-175d-...",
"name": "F5TTS English 2",
"reference_audio": "en.wav",
"language": "en",
"user_id": "default"
}
]
}List Collections
Retrieve all speaker collections for a user. A collection maps language codes to specific speakers enabling automatic multilang TTS routing.
"default"{
"collections": [
{
"id": "637b6576-a49b-...",
"name": "Bộ Sưu Tập 1",
"description": "BST1",
"user_id": "long",
"langs": {
"vi": {"id": "de315d3d-...", "name": "Giọng Google"},
"en": {"id": "ca8df695-...", "name": "F5TTS English 2"},
"fr": {"id": "fd8d4043-...", "name": "F5TTS French"},
"ru": {"id": "f58f853b-...", "name": "F5TTS Russian"}
}
}
]
}Create Collection
Create a new speaker collection mapping language codes to speaker UUIDs. Only languages with a speaker assigned need to be included in langs.
"default" for shared.await fetch('http://localhost:8005/api/collections', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({
name: "My Collection",
user_id: "default",
description: "Multilang bot voice",
langs: {
"vi": "de315d3d-bb11-...",
"en": "ca8df695-175d-...",
"fr": "fd8d4043-665a-..."
}
})
});{
"id": "637b6576-a49b-40f2-9172-...",
"name": "My Collection",
"user_id": "default"
}Update Collection
Update a collection's name, description, or language-to-speaker mappings.
await fetch(
'http://localhost:8005/api/collections/637b...',
{
method: 'PUT',
headers: {'Content-Type':'application/json'},
body: JSON.stringify({
name: "Updated Name",
langs: { "vi": "uuid-vi", "en": "uuid-en" }
})
}
);Delete Collection
Permanently delete a speaker collection. This cannot be undone.
await fetch(
'http://localhost:8005/api/collections/637b...',
{ method: 'DELETE' }
);{ "status": "deleted" }Voice Chat (External ASR)
Standalone voice chat server that bridges to an external ASR WebSocket instead of running its own VAD pipeline. Identical protocol to the main server at :8124.
Voice Chat — External ASR Bridge
Same protocol as :8124/ws/voice-chat. Designed for use when you want to route ASR through a separate endpoint (e.g. the VAD streaming endpoint at :8124/ws/transcribe/vad). Set asr_ws_url in the config frame to connect the bridge.
When
asr_ws_url is set, audio is forwarded to the external ASR. When omitted, built-in VAD is used.ws://localhost:8124/ws/transcribe/vadws://localhost:8005/ws/ttsAll other messages and response events are identical to :8124/ws/voice-chat.
const ws = new WebSocket(
'ws://localhost:8125/ws/voice-chat'
);
ws.onopen = () => {
ws.send(JSON.stringify({
type: 'config',
asr_ws_url:
'ws://localhost:8124/ws/transcribe/vad',
tts_ws_url:
'ws://localhost:8005/ws/tts',
speaker_id: 'uuid-speaker',
collection_id: 'uuid-collection',
language: 'vi',
llm_model: 'gpt-4o-mini'
}));
};