If your team is looking to migrate enterprise AI workloads away from expensive, black-box APIs toward secure, self-hosted open-source infrastructure, optimizing the inference stack is the first real hurdle you’ll hit. Cedric Clyburn and Andrew Ng just put together a hands-on short course on the DeepLearning.AI platform. It breaks down vLLM and provides copyable code examples throughout. Instead of treating the inference server like an abstract system, it directly targets the memory and hardware realities that dictate production scaling: KV cache bottleneck: Visualizing exactly why autoregressive decoding scales poorly on VRAM bandwidth and how virtual block allocation abstracts that away to save your compute budget. Post-training compression : Hands-on labs using LLM Compressor to implement FP8 dynamic quantization without wrecking your model’s accuracy. Production benchmarking : Profiling your models to map out latency vs. RPS (requests per second) curves so you can actually predict infrastructure costs. If you are trying to scale local models within private enterprise boundaries and need a clean, open-source recipe for optimization pipelines, it’s short, practical, and I highly recommend it: https://www.deeplearning.ai/courses/fast-and-efficient-llm-inference-with-vllm Disclosure: I work at Red Hat on the vLLM community side and am the original creator of LLM Compressor, so I’m clearly not a neutral party here. But the engineering focus is real, the content is great, and Cedric knows his stuff. submitted by /u/markurtz
Originally posted by u/markurtz on r/ArtificialInteligence
