I built an MCP server called Paper Lantern that gives Claude Code access to 2M+ CS research papers. For each query it searches full-text papers and returns a synthesis — what methods exist, tradeoffs, benchmarks, and how to implement them. Wanted to see if it actually changes what Claude Code does on a real task, so I ran a controlled experiment with Karpathy’s autoresearch framework (agent tries 100 ML training ideas overnight, keeps what works). Setup: Two identical runs. Same Claude Code agent, same GPU, same ~7M param model. Only difference: one had Paper Lantern connected as an MCP tool. Without Paper Lantern: Claude explored the standard playbook — batch size tuning, weight decay, gradient clipping. 3.67% improvement over baseline. With Paper Lantern: Claude queried the server before each new idea. 520 papers considered, 100 cited, 25 directly tried. 4.05% improvement over baseline. The interesting part was the qualitative difference. Both runs tried halving the batch size. Without PL, Claude didn’t know to adjust the learning rate — experiment failed. With PL, Claude asked “what research exists on batch scaling for short runs?”, found the sqrt scaling rule from a 2022 paper, implemented it, win on first try. Same intuition. Different knowledge. Different outcome. The real test: Best config from each run trained for 2 hours. PL config: 0.4475 val_bpb. No-PL config: 0.4624. 3.2% better, gap still widening. Not every paper idea worked (DyT and SeeDNorm were incompatible with the architecture), but the ones that did were unreachable without the research access. Full writeup with all 15 paper citations: https://www.paperlantern.ai/blog/auto-research-case-study Paper Lantern works with Claude Code, Cursor, Copilot, Claude.ai, ChatGPT, any MCP client: https://code.paperlantern.ai/ submitted by /u/kalpitdixit
Originally posted by u/kalpitdixit on r/ClaudeCode
