Hey everyone, I’ve been building LLM-based apps recently, and I kept running into the same problem: Prompt and models changes weren’t tracked properly No clean way to compare experiment results Evaluation logic ended up scattered across the codebase Hard to reproduce past results So I built a small open-source project called Modelab for llms A/B testing very quickly. The idea is simple: Version prompt / model experiments Run structured evaluations Track performance regressions Keep experiment logic clean and modular I’m still shaping the direction, and I’d really value feedback from people building with LLMs: What’s missing from current eval workflows? What tools are you using instead? Would you prefer something event-based or decorator-based? Repo: https://github.com/elliot736/modelab Happy to hear thoughts, criticism, or ideas. submitted by /u/marro7736
Originally posted by u/marro7736 on r/ArtificialInteligence
