Original Reddit post

​ Hi all, I’m building a system that takes a circuit image (breadboard/schematic) and answers questions about it. I’m looking for practical, implementation-focused advice (not just paper links). Goal Input: image + question Output: generated explanation (not just labels) Example:

  • Q: “What is this circuit?”
  • A: “LED flasher using transistor… (how it works, current flow, etc.)”

What I plan to use

  • VLM: BLIP-2 or LLaVA (for image + question understanding)
  • LLM: any good text model for explanation
  • Python + HuggingFace + PyTorch
  • Simple UI (Streamlit)

My current pipeline idea Image → VLM (extract components + description) → LLM (generate explanation) → output

What I need help with Best architecture:

  • Direct VLM answer vs VLM → LLM chain — which works better in practice? Circuit-specific understanding:
  • Any datasets or tricks for diagrams/breadboards?
  • Is something like CircuitVQA worth using? Fine-tuning vs prompt-only:
  • Is LoRA/QLoRA worth it here, or can I stay zero-shot? Detection + reasoning:
  • Should I add a detector (YOLO/Detectron) for components before the VLM? Evaluation:
  • How do you evaluate answers for VQA-style systems beyond BLEU/F1? Minimal working stack:
  • If you had to build an MVP in 2–3 days, what exact stack would you pick?

Constraints

  • Prefer open models / local or free options
  • Focus on generative output (explanations), not just classification

If you’ve built something similar or have pointers (repos, configs, pitfalls), I’d really appreciate it. Thanks! submitted by /u/vishal55282

Originally posted by u/vishal55282 on r/ArtificialInteligence