Sovereign Utility

Voice AI.

MeltyBase provides a low-latency orchestration layer for bidirectional voice interaction. Your agents can hear, think, and speak natively on your infrastructure.

Important: Model Independence

MeltyBase does not offer its own proprietary LLM. We function as a sovereign orchestration hub that bridges your voice data with your choice of intelligence models (OpenAI, Anthropic, or self-hosted local models) via a Bring-Your-Own-Key (BYOK) architecture.

01. The Voice Interaction Hub

Instead of managing complex audio pipelines, developers use the MeltyBase VoiceHandlers to stream audio data. The Hub handles the heavy lifting of bridging audio streams with the OpenClaw agent engine.

WebSocket Streaming

Bi-directional, full-duplex streaming via WebSockets for real-time interaction with sub-200ms latency.

Unified Interface

Standardized Synthesize and Transcribe interfaces allow you to swap providers (ElevenLabs, Whisper, Piper) with a single config change.

02. Speech-to-Text (STT)

Transcribe incoming audio streams into actionable intelligence. MeltyBase supports multi-modal perception where agents "hear" and react to vocal cues.

// Streaming Transcription Endpoint GET /v1/hub/voice/stream Upgrade: websocket X-Project-ID: {uuid}
  • Real-time Transcription: Stream audio chunks directly to a background worker for immediate text conversion.
  • Batch Upload: Support for standard audio file uploads (WAV, MP3) for asynchronous auditing.

03. Text-to-Speech (TTS)

Give your agents a native voice. MeltyBase orchestrates high-fidelity synthesis for natural, emotive responses.

Persona Matching

Attach VoiceID metadata to agent formulas to ensure consistent brand identity across vocal interactions.

Streamed Synthesis

Start playing audio as it is being generated, eliminating the "wait time" typically associated with AI speech.

04. Privacy & PII Masking

In a voice-first world, privacy is the primary constraint. MeltyBase enforces strict data boundaries for vocal data.

  • Transcription Masking: Transcripts are automatically scrubbed of Personally Identifiable Information (PII) before being passed to the intelligence model.
  • In-Memory Processing: Voice data is processed in-memory and never persisted to long-term storage unless explicitly vaulted for audit compliance.
  • Local Execution: Orchestrate STT/TTS on your own hardware to ensure voice prints never leave your perimeter.