The gap between “measured prompt performance” and “systematically improved prompt” is where most teams are stuck. PromptFoo gives you the measurement. AutoResearch gives you the iteration pattern. AutoPrompter combines both. To solve this, I built an autonomous prompt optimization system that merges PromptFoo-style validation with AutoResearch-style iterative improvement. The Optimizer LLM generates a synthetic dataset from the task description, evaluates the Target LLM against the current prompt, scores outputs on accuracy, F1, or semantic similarity, analyzes failure cases, and produces a refined prompt. A persistent ledger prevents duplicate experiments and maintains optimization history across iterations. Usage example for optimizing a prompt for technical blog writing: python main.py --config config_blogging.yaml What this actually unlocks for serious work: prompt quality becomes a reproducible, traceable artifact. You validate near-optimality before deployment rather than discovering regression in production. Open source on GitHub: https://github.com/gauravvij/AutoPrompter How it works in detail: The system operates in a continuous loop where an Optimizer LLM refines prompts for a Target LLM based on empirical performance data. Dataset Generation : The Optimizer LLM (Gemini 3.1 Flash - customizable through config.yaml) generates a synthetic dataset of input/output pairs based on the task description. Iterative Improvement : The Target LLM (Qwen 3.5 9b) is tested against the current prompt using the generated dataset. Performance is measured using a defined metric (Accuracy, F1, Semantic Similarity, etc.). The Optimizer LLM analyzes failures and successes to generate a refined prompt. Experiment Ledger : Every iteration is recorded in a persistent ledger to prevent duplicate experiments and track progress. Context Management : The system manages the history of experiments to provide the Optimizer LLM with relevant context without exceeding window limits. FYI: One open area for contribution: Dataset quality is dependent on Optimizer LLM capability. Curious how others working in automated prompt optimization are approaching either? submitted by /u/gvij
Originally posted by u/gvij on r/ArtificialInteligence
