eifachposte

eifachposte

Disclosure: I’m the solo developer of AllChat, an iOS app that queries multiple AI models at once and combines their responses. Wanted to break down the technical side of this because I haven’t really seen anyone talk about what it’s actually like to ship multi-model orchestration in a consumer app. There’s some genuinely weird tradeoffs. The Problem Every AI chat app is basically a black box. You ask ChatGPT something, you get one answer, no idea if it’s confident or just making stuff up. I got tired of copy-pasting the same question into three different apps to cross-check answers. Felt like the software should be doing that, not me. How It Works When someone sends a message in Consensus Mode, the backend fans out the request to multiple models in parallel. Same prompt, same conversation history going to each one. They all come back independently so I’m not sitting around waiting on whatever model is being slow that day. Once all the responses are in, a synthesis step takes everything and does something closer to meta-analysis than just picking a winner. It finds where models agreed, where they straight up contradicted each other, and which claims had real sources behind them. Then it produces one final answer from all of that. The transparency layer is where I spent the most time honestly. The synthesis spits out structured metadata and I parse that into a UI where you can tap into “Points of Agreement,” “Contradictions,” “Sources,” all that. So you get one clean answer up front but if you want to see why it said what it said, that’s all one tap away. Tradeoffs Cost is the thing that will kill you. Consensus means N model calls plus a synthesis call for every single message. Run 4 models and you’re looking at roughly 5x the cost of a normal single-model query. That one fact drove basically every other decision I made. I went with daily usage caps instead of lifetime ones because it creates better habit loops without letting costs run away. Had to be smart about which model does the synthesis too so I’m not burning flagship-tier tokens just on the merge step. And just generally being aggressive about keeping context sizes down so token counts don’t balloon. Latency is actually less bad than you’d think since everything goes out in parallel. You’re only as slow as your slowest model, not the sum of all of them. The synthesis step is sequential though so that adds time. I got around it on the UX side by streaming the final answer to the user while the transparency metadata is still loading behind it. So you see your answer quick, the receipts show up a second later. Dealing with different context windows across providers is just annoying. Every model has different limits, handles system prompts differently, does tool calls differently. I keep one canonical conversation format on my end and convert it per provider when the request goes out. If someone switches models mid-conversation I have to condense everything down to a provider-neutral format, which is its own headache. Something that genuinely surprised me though. I assumed early on that if all 4 models agreed on something, that answer was probably solid. Wrong. Models share training data, so they’ll all confidently agree on the same hallucination. Turns out the contradiction signal is actually way more useful than the agreement signal. When one model disagrees with the other three, that’s where you should pay attention. What I Learned The transparency thing changes how people interact with AI more than I expected. When you can see that three models agreed and one didn’t, you stop just accepting or rejecting the whole answer. You start looking at individual claims. That was the whole point and it actually works. Other big takeaway is that consensus quality is about model diversity not model count. Throwing a fourth model in that’s basically the same architecture as the other three doesn’t really help. But adding one that was trained on different data or reasons differently, that actually improves things noticeably. The mix matters way more than the number. Built the whole thing solo. iOS, backend, all of it. Happy to go deeper on any part of this. Individual model responses before synthesis. You can expand each one to see what it actually said. Agreement analysis. The ring shows overall model consensus, then lists each specific point where all models aligned. submitted by /u/Ryn8tr

Originally posted by u/Ryn8tr on r/ArtificialInteligence

I send every user question to GPT, Claude, Gemini, and Grok at the same time, then synthesize one answer. Here's the architecture and what I learned about multi-model consensus.

I send every user question to GPT, Claude, Gemini, and Grok at the same time, then synthesize one answer. Here's the architecture and what I learned about multi-model consensus.