Original Reddit post

As the conversation around AI doing knowledge work gets louder, We’ve been trying to ground it in something more concrete. Can LLM agents actually execute regulated, multi-step industrial processes correctly and not just produce the right answer? Outcome accuracy and process fidelity are not the same thing. A model that approves a loan without running KYC first is wrong — even if approval was ultimately the correct decision. Most benchmarks only measure the former. Introducing LOAB github: https://github.com/shubchat/loab LOAB is an early attempt to measure both.Each run is scored independently across: Tool ordering Policy lookups Agent handoffs Forbidden action avoidance Final outcome This allows us to separate: “Got the answer right” from “Followed the regulated process correctly” Early Results 3 origination tasks · 4 runs per model Even at this small scale, the divergence between outcome accuracy and full-rubric pass rate suggests a major gap between benchmark intelligence and deployable, regulated reliability. There’s significant opportunity in optimizing AI workflows so agents can function as compliant, policy-bound operators and not just answer generators. This is a proof of concept: 3 tasks One workstream Australian lending standards The intent is to expand across the full lending lifecycle — and eventually into other regulated industries. A paper is in progress. In the meantime, would genuinely appreciate feedback or thoughts from the community. Thank you :) submitted by /u/Bytesfortruth

Originally posted by u/Bytesfortruth on r/ArtificialInteligence