Most computer-use demos work by feeding a screenshot to an LLM and asking it to output pixel coordinates. This works surprisingly often but fails in predictable ways: Retina scaling, window repositioning, UI density, and any layout change break it. The approach I’ve been exploring in opendesk is: query the platform’s native accessibility API first (AppleScript on macOS, AT-SPI2 on Linux, UI Automation on Windows), get the actual interactive elements with their labels and bounding boxes, then draw numbered chips on those elements before the screenshot ever reaches the LLM. The model never guesses coordinates. It reasons about what to do and references elements by their mark number. The system already knows exactly where mark 7 is. Mouse coordinates become a fallback for elements with no accessible label — canvas areas, video players, games. Another idea in the same vein: when replaying a recorded workflow, don’t replay coordinates. Store the trajectory as a sequence of events and screenshots, and at replay time feed that as context to the LLM, which re executes it against the current screen state. This makes replay adaptive rather than brittle. Waiting for feedback from the community! 😃 Github: https://github.com/vitalops/opendesk submitted by /u/metalvendetta
Originally posted by u/metalvendetta on r/ArtificialInteligence
