Hi all, I’m building a system that takes a circuit image (breadboard/schematic) and answers questions about it. I’m looking for practical, implementation-focused advice (not just paper links). Goal Input: image + question Output: generated explanation (not just labels) Example:
- Q: “What is this circuit?”
- A: “LED flasher using transistor… (how it works, current flow, etc.)”
What I plan to use
- VLM: BLIP-2 or LLaVA (for image + question understanding)
- LLM: any good text model for explanation
- Python + HuggingFace + PyTorch
- Simple UI (Streamlit)
My current pipeline idea Image → VLM (extract components + description) → LLM (generate explanation) → output
What I need help with Best architecture:
- Direct VLM answer vs VLM → LLM chain — which works better in practice? Circuit-specific understanding:
- Any datasets or tricks for diagrams/breadboards?
- Is something like CircuitVQA worth using? Fine-tuning vs prompt-only:
- Is LoRA/QLoRA worth it here, or can I stay zero-shot? Detection + reasoning:
- Should I add a detector (YOLO/Detectron) for components before the VLM? Evaluation:
- How do you evaluate answers for VQA-style systems beyond BLEU/F1? Minimal working stack:
- If you had to build an MVP in 2–3 days, what exact stack would you pick?
Constraints
- Prefer open models / local or free options
- Focus on generative output (explanations), not just classification
If you’ve built something similar or have pointers (repos, configs, pitfalls), I’d really appreciate it. Thanks! submitted by /u/vishal55282
Originally posted by u/vishal55282 on r/ArtificialInteligence
You must log in or # to comment.
