Title: I built a pharmacy training app almost entirely with AI (Gemini, ChatGPT, Claude Code). I didn’t write the code — my role was testing, bug-hunting, and process. Am I on the right track?
Upfront disclaimer: I did not write this code. Effectively none of it. The entire app is AI-assisted — I worked across Gemini, ChatGPT, and Claude Code over several weeks. I can read Python and I know exactly what I want the app to do, but the actual implementation came from the models. I’m posting because the part I do own is everything around the code — the testing, the bug-hunting, the documentation discipline, and refusing to let the AI mark things “done” before they’re verified. I want honest feedback on whether that’s a sound way to work, or whether I’m missing something an experienced engineer would consider obvious.
The app A training and workflow tool for pharmacy technicians. It runs on Android via Pydroid 3 and on desktop. Roughly 2,600 lines of Python (Tkinter UI, SQLite storage). Features:
- Practice quizzes (brand/generic, look-alike/sound-alike drug pairs, red-flag scenarios)
- Drug lookup with interaction flags
- Clinical calculators: insulin days-supply, pediatric weight-based dosing, body surface area, Cockcroft-Gault creatinine clearance
- DEA registration number verification (real checksum algorithm)
- A partial-fill ledger, audit log, and PIN-protected admin panel
How the workflow evolved This was not a straight line. The progression, roughly: Early on: I started with “AI, build me a thing.” I was already on something like my 13th iteration of a single giant Python file before I had any real process. I also did my own research — I pulled real reference material (my state board of pharmacy regulations, ISMP medication-safety sheets, the pharmacy adjudication/IC+ workflow) so the app wasn’t built purely on whatever the model invented. The scaling problem: The app got rebuilt into an even larger single file. That’s when “ask AI, paste, run” stopped working — every new chat session, the AI had no memory of the project, so I’d burn half my time re-explaining. My fix was to start writing documents the AI could read to get itself up to speed: a behavior spec, an audit log, a “current state” doc, a known-bugs list, a decisions log. That set of docs became a real system. A dedicated stress-testing pass: I spent a full day doing nothing but trying to break the app. I saved snapshot bundles as I went (I have builds literally named “crunch,” “post-audit,” “max-stress,” and “max-stress-final”). At one point my test file was larger than the app itself. This is also where I moved the heavy lifting to Claude Code. Evidence labeling: I noticed the AI repeatedly told me things were “fixed” or “done” when they weren’t. So I started forcing every claim to carry a label: CONFIRMED (I ran it on the device and it works) vs UNVERIFIED (code is written, nobody has run it). “I fixed the bug” and “I wrote a fix I never tested” are completely different statements, and the models will state the first while meaning the second. Modular refactor: Most recently, the single file got too large to test sanely. I had the AI split it into modules — config, theme, clinical data, pure logic, database layer, UI, entrypoint — primarily so the logic could be tested headlessly without launching a GUI window. The logic module now has a 29-case regression test that runs in about a second with no UI. An honest wrinkle: my own documentation system is somewhat messy. I had the AI rebuild my “harness” of control docs a few times, and it left duplicate copies scattered across my Drive because the tooling can create files but can’t delete or rename them. So even my note-keeping has drift, and I’m learning to manage that.
Things I found out (the testing is genuinely my contribution) The AI writes something, it looks correct, the automated tests pass — and then I run it on real hardware and it breaks in ways nothing predicted. Two concrete examples: PIN not surviving a restart. I set a new admin PIN, closed the app, reopened it, and couldn’t log in. The AI’s immediate assumption was “the database isn’t persisting.” I investigated and the database was saving correctly. The real cause: the PIN entry field is masked (dots, no echo) with no confirmation step — so a single typo silently locks you out of a PIN you can’t reproduce. No automated test would ever catch that. I only found it by being the ordinary user who fat-fingers a PIN. The fix was a confirmation re-entry dialog. Lockout not persisting. The account lockout (3 failed attempts = locked) didn’t survive a force-quit. I found it by force-quitting the app while locked out and walking straight back in. The lockout timer was only held in memory; the fix was to persist it to the database. Neither bug appeared in any automated test. I found both by trying to break the app the way a tired technician at 7pm would. Treating AI-generated data as untrusted. The drug lists, law summaries, and vaccine information were all AI-generated, and I don’t trust any of it clinically. So every screen that displays that data carries a visible UNVERIFIED banner and a link to the authoritative source (CDC immunization schedule, state board of pharmacy). I would rather the app tell the user “verify this yourself” than quietly be wrong about a dose.
My actual question I’m not a developer. The code is not mine. But the testing, the bug-hunting, the “no, you don’t get to call that done,” and the discipline of keeping clinical data honestly labeled — that’s the work I’m actually doing. I don’t know if that’s the right thing to be spending my effort on, or whether I’m overlooking something fundamental that a real engineer would prioritize first. Is this a reasonable, sustainable way to build with these tools — or am I building a house of cards? Honest feedback genuinely welcome. I’d rather hear the problems now. submitted by /u/lostsoulfs
Originally posted by u/lostsoulfs on r/ClaudeCode
