Most agent stacks are still optimized for capability demos, not operational accountability. In practice, that means we can often get useful outputs, but struggle to answer critical production questions: What exactly did the system do? Why did it choose that path? Can we reproduce this result reliably? Which controls existed before execution (not just logs after the fact)? My work on ORCA explores a different design point: treat agent behavior as a structured execution system, not only prompt-time composition. Core idea: Explicit step boundaries Typed input/output contracts Deterministic control flow where required Policy-gated execution for high-risk actions Full execution traceability for replay and audit This is not anti-LLM. It is about separating: Discovery mode: flexible, emergent, exploratory Production mode: promoted, validated, governed capabilities I see this as a practical bridge between prompt-native experimentation and deployable systems in sensitive domains (security, infra, regulated workflows). References: SSRN paper: https://papers.ssrn.com/sol3/papers.cfm?abstract%5C_id=6600840 Zenodo artifact: https://zenodo.org/records/19438943 Repository: https://github.com/gfernandf/agent-skills I would value feedback from people running real agent workloads: How are you handling pre-execution controls vs post-execution observability? Where do you draw the boundary between adaptive orchestration and deterministic guarantees? What failure mode appears first in production: drift, cost, safety, or unreproducibility? submitted by /u/gfernandf
Originally posted by u/gfernandf on r/ArtificialInteligence
