I’ve been experimenting with a different inference architecture for GGUF models. DoE is a single C-file runtime architecture that wraps any GGUF model with a dynamic parliament of LoRA experts that vote and adapt during inference. Compile: cc doe.c -O3 -lm -lpthread -o doe Run: ./doe --model model.gguf --serve 8080 Features: - works with existing GGUF models (Llama, Qwen, Mistral, SmolLM) - weights are mmap’ed read-only - LoRA experts operate on top of the base model - experts vote per token to determine the final residual update - experts can spawn or disappear during inference based on usage - simple gradient-free weight adaptation during generation Other details: - ~3184 LOC single C file - no runtime dependencies - auto-detects tokenizer + chat templates - built-in HTTP chat server - optional CUDA / BLAS acceleration repo: https://github.com/ariannamethod/doe arch: https://github.com/ariannamethod/doe/blob/main/docs/doe///_architecture.md submitted by /u/ataeff
Originally posted by u/ataeff on r/ArtificialInteligence
