I have an RTX Pro 6000 Blackwell MaxQ that comfortably fits gpt-oss-120b inside it with plenty of room for VRAM. This model has also proven to be a local powerhouse for vibecoding for me with Claude Code and I wanted to extend its capabilities further. Currently it is running on Ollama, with num_parallel set to 1 agent at a time. That’s fine for a lot of reasons since I run a lot of apps simultaneously that use that exact same model and configuration in Ollama to prevent VRAM/RAM blowups and model reloads. The thing with Agent Teams is that they’re supposed to run in parallel but running this with one agent at a time is…less than ideal. It can get the job done but holy slowdown its slow af. I understand vLLM has a few tricks up its sleeve that can allow it to run parallel requests on the same GPU with little to no side effects but I’ve never messed around with it before. I’m willing to try if that is the case but I’ll have to set it up in WSL2 first. How much can I get away with if I were to go the vLLM route for parallel agents on a 96GB GPU? submitted by /u/swagonflyyyy
Originally posted by u/swagonflyyyy on r/ClaudeCode
