Sharing something we’ve been building: Lumen, a browser agent framework that takes a purely vision-based approach, drawing on SOTA techniques from the browser agent and VLA researches. No DOM parsing, no CSS selectors, no accessibility trees. Just screenshots in, actions out. GitHub: https://github.com/omxyz/lumen Prelim Results: We ran a 25-task WebVoyager subset (stratified across 15 sites, 3 trials each, LLM-as-judge scored): All frameworks running Claude Sonnet 4.6. SOTA techniques we built on: Pure vision loop building on WebVoyager (He et al., 2024) and PIX2ACT (Shaw et al., 2023), but fully markerless. No Set-of-Mark overlays, just native model spatial reasoning. Two-tier history compression (screenshot dropping + LLM summarization at 80% context utilization), inspired by recent context engineering work from Manus and LangChain’s Deep Agents SDK, tuned for vision-heavy trajectories. Three-layer stuck detection with escalating nudges and checkpoint backtracking to break action loops. ModelVerifier termination gate : a separate model call verifies task completion against the screenshot before accepting “done,” closing the hallucinated-completion failure mode. Child delegation for sub-tasks (similar to Agent-E’s hierarchical split) SiteKB for domain-specific navigation hints (similar to Agent-E’s skills harvesting). Also supports multi-provider (Anthropic/Google/OpenAI/Ollama and also various browser infras like browserbase, hyperbrowser, etc), deterministic replays, session resumption, streaming events, safety primitives (domain allowlists, pre-action hooks), and action caching. example: import { Agent } from “@omxyz/lumen”; const result = await Agent.run({ model: “anthropic/claude-sonnet-4-6”, browser: { type: “local” }, instruction: “Go to news.ycombinator.com and tell me the title of the top story.”, }); Would love feedback! submitted by /u/kwk236
Originally posted by u/kwk236 on r/ArtificialInteligence

