I’ve now either built or audited four AI systems for legal/compliance work. Different firms, different jurisdictions, different stacks. The failure modes when these systems break in production are weirdly consistent, almost to the point where I can predict which one will hit before I see the system. Writing this up because I think it’s useful for anyone building in this space, and also because I keep getting asked the same questions and I’d rather link to one place than answer them piecemeal. Failure mode one. The system treats all sources as equally credible. Already wrote this up separately so I won’t repeat it in detail. Short version: a legal corpus is a hierarchy, not a flat set of documents. If your retrieval doesn’t encode the hierarchy, your system will confidently surface a commentary article over a binding court ruling on close calls, and the senior lawyer will clock the failure on day one and never use the system again. The fix is metadata-based authority weighting at the chunking and re-ranking layers. Failure mode two. The system has no opinion when sources disagree. This one is subtler and arguably more dangerous. Real legal questions often have two or more defensible answers depending on which court you’re in or which interpretation prevails. A naive RAG system either picks one answer at random based on which chunk happened to retrieve higher, or it tries to synthesize them into a single answer that doesn’t actually exist in the law. Both failures destroy trust. The lawyer reads the answer, knows there are two positions, and either sees that the system picked the wrong one or sees a synthesized answer that no court has ever held. Either way the lawyer learns the system can’t be trusted with any question that has nuance, which is most of them. What to build instead. A disagreement-detection step that runs after retrieval and before generation. If the top retrieved chunks contain materially different positions, the system should explicitly surface that fact. “Two positions exist on this question. The Federal Court of Justice held X. The Munich Higher Regional Court has gone the other way in Y line of cases. Here is the analysis on each.” That output is genuinely useful to a lawyer because it matches how they actually think. A confident single answer that papers over the disagreement is worse than no answer at all. Failure mode three. The system has no way to learn the firm’s interpretation. Every law firm and compliance team has internal positions that aren’t in any public source. “We always read this clause to mean X.” “Last year we got a regulator question on this and the answer that worked was Y.” “Partner Z disagrees with the consensus reading of this regulation and his read has been more accurate in our practice.” This knowledge lives in three people’s heads and partially in old emails, and it never makes it into a public corpus. A system that only retrieves from public sources is missing 30 to 60 percent of the actual reasoning the firm uses. So the system gives generic answers and the firm keeps doing the real work in their heads. Adoption stalls within a month because the senior lawyers correctly clock that the system is just a faster version of a public legal database, and they already have those. What to build instead. An annotation layer where senior lawyers can flag a source with the firm’s interpretation, override generic answers with firm-specific guidance, and build up institutional reasoning over time. The annotation layer is the thing that separates a tool from a piece of the firm’s actual decision-making infrastructure. It’s also the thing that compounds in value: every interpretation a senior lawyer adds today is worth more next year because it’s available to every junior associate forever. The pattern across all three. Naive legal RAG fails because the legal domain isn’t a corpus, it’s a hierarchy of trust with disagreements and firm-specific overlays on top. Any system that treats the corpus as flat will pass the demo and fail in real use. Systems that explicitly model hierarchy, disagreement, and firm-specific interpretation tend to stick. If you’re building one of these or evaluating someone else’s, the test I’d run is simple: hand it three queries that you know have nuanced answers in your firm’s practice, and watch what it does. If it returns confident single answers without surfacing the nuance, the system isn’t ready. If it surfaces the disagreement and the firm’s prior position on it, you have something worth deploying. submitted by /u/Fabulous-Pea-5366
Originally posted by u/Fabulous-Pea-5366 on r/ArtificialInteligence
