Original Reddit post

Open-source Swift package running 11 speech models on Apple Silicon via MLX (GPU) and CoreML (Neural Engine). Fully local inference, no cloud dependency. Models implemented: ASR

  • Qwen3-ASR 0.6B/1.7B (4-bit), Parakeet TDT (CoreML INT4) - RTF ~0.06 on M2 Max TTS
  • Qwen3-TTS 0.6B (4-bit), CosyVoice3 0.5B (4-bit) - Streaming, ~120ms first chunk Speech-to-speech
  • PersonaPlex 7B (4-bit) - Full-duplex, RTF ~0.87 VAD
  • Silero v5, Pyannote segmentation-3.0 - Streaming + overlap detection Diarization
  • Pyannote + WeSpeaker + spectral clustering - Auto speaker count via GMM-BIC Enhancement
  • DeepFilterNet3 (CoreML) - Real-time 48kHz noise suppression Alignment
  • Qwen3-ForcedAligner - Non-autoregressive, RTF ~0.018 Key design choice: MLX for large models on GPU, CoreML for small models on Neural Engine. This lets you run VAD on ANE while ASR runs on GPU without contention — something WhisperKit struggles with (their Core ML audio encoder blocks the ANE for 300-600ms per call). All models conform to shared protocols, so you can swap implementations or compose pipelines. Currently working on a MeetingTranscriber pipeline (diarize → per-segment ASR) and streaming real-time diarization. Roadmap: https://github.com/soniqo/speech-swift/discussions/81 Repo: https://github.com/soniqo/speech-swift submitted by /u/ivan_digital

Originally posted by u/ivan_digital on r/ArtificialInteligence