Running inference around 2,000 requests per second. Added a gateway for provider abstraction and it’s adding 30-40ms latency per request. We’re using this for real-time ML serving where every millisecond compounds. 40ms gateway + 200ms model inference = users start noticing lag. Tried the usual optimizations - async, connection pooling, multiple workers. Helped but didn’t solve it. The issue seems to be Python’s concurrency model at this scale. Looked at alternatives: custom Nginx setup (too much manual config), Portkey (seems enterprise-focused and pricey). We ended up trying Bifrost (Go-based and Open source). Latency dropped to sub-100 microseconds overhead. Still early but performance is solid. Has anyone scaled Python-based gateways past 2k RPS without hitting this wall? Or did you end up switching runtimes? What are high-throughput shops using for LLM routing? submitted by /u/llamacoded
Originally posted by u/llamacoded on r/ArtificialInteligence
