eifachposte

eifachposte

Original Reddit post

I get asked this constantly. Here’s the actual answer instead of the tutorial answer. Prompt engineering is right when:

Task is general-purpose (support, summarisation, Q&A across varied topics)
Training data changes frequently, news, live product data, user-generated content
You have fewer than ~500 high-quality labelled pairs
You need to ship fast and iterate based on real usage, not assumptions
You haven’t yet measured your specific failure mode in production. This is the most important one. Fine-tuning is right when:
Format or tone needs to be absolutely consistent, and prompting keeps drifting on edge cases
Domain is specialised enough that base models consistently miss terminology (regulatory, clinical, highly technical product docs)
You’re at 500K+ calls/month and want to distil behaviour into a smaller/cheaper model to cut inference costs
Hard latency constraint and prompts are getting long enough to hurt response times
You have 1,000+ trusted, high-quality labelled examples, from real production data, not synthetic generation The mistake I keep seeing: Teams decide to fine-tune in week 2 of a project because “we know the domain is specialised.” Then they build a synthetic training dataset based on their assumptions about what the failure cases will look like. The problem : actual production usage differs from assumed usage. Almost every time. The synthetic dataset doesn’t match the real distribution. The fine-tuned model fails on exactly the patterns that mattered. Our actual process: Start with prompt engineering. Always. Ship it. Collect real failure cases from production interactions. Identify the specific pattern that’s failing. Fine-tune on that specific failure mode, using production data, with the examples that actually represent the problem. Why the sequence matters (concrete example): A client saved $18K/month by fine-tuning GPT-3.5 on their classification task instead of calling GPT-4: same accuracy, 1/8th the cost. But those training examples only existed after 3 months of production data. If they’d fine-tuned on synthetic examples in month 1, the training distribution would have been wrong, and the model would have been optimised for the wrong failure modes. The 3-month wait produced a model that actually worked. Rushing to fine-tune would have produced technical debt. At what call volume does fine-tuning become worth the overhead for you? Curious whether the 500K/month threshold matches others’ experience. submitted by /u/Individual-Bench4448

Originally posted by u/Individual-Bench4448 on r/ArtificialInteligence

After building 10+ production AI systems - the honest fine-tuning vs prompt engineering framework (with real thresholds)

After building 10+ production AI systems - the honest fine-tuning vs prompt engineering framework (with real thresholds)