Original Reddit post

I get asked this constantly. Here’s the actual answer instead of the tutorial answer. Prompt engineering is right when:

  • Task is general-purpose (support, summarisation, Q&A across varied topics)
  • Training data changes frequently, news, live product data, user-generated content
  • You have fewer than ~500 high-quality labelled pairs
  • You need to ship fast and iterate based on real usage, not assumptions
  • You haven’t yet measured your specific failure mode in production. This is the most important one. Fine-tuning is right when:
  • Format or tone needs to be absolutely consistent, and prompting keeps drifting on edge cases
  • Domain is specialised enough that base models consistently miss terminology (regulatory, clinical, highly technical product docs)
  • You’re at 500K+ calls/month and want to distil behaviour into a smaller/cheaper model to cut inference costs
  • Hard latency constraint and prompts are getting long enough to hurt response times
  • You have 1,000+ trusted, high-quality labelled examples, from real production data, not synthetic generation The mistake I keep seeing: Teams decide to fine-tune in week 2 of a project because “we know the domain is specialised.” Then they build a synthetic training dataset based on their assumptions about what the failure cases will look like. The problem : actual production usage differs from assumed usage. Almost every time. The synthetic dataset doesn’t match the real distribution. The fine-tuned model fails on exactly the patterns that mattered. Our actual process: Start with prompt engineering. Always. Ship it. Collect real failure cases from production interactions. Identify the specific pattern that’s failing. Fine-tune on that specific failure mode, using production data, with the examples that actually represent the problem. Why the sequence matters (concrete example): A client saved $18K/month by fine-tuning GPT-3.5 on their classification task instead of calling GPT-4: same accuracy, 1/8th the cost. But those training examples only existed after 3 months of production data. If they’d fine-tuned on synthetic examples in month 1, the training distribution would have been wrong, and the model would have been optimised for the wrong failure modes. The 3-month wait produced a model that actually worked. Rushing to fine-tune would have produced technical debt. At what call volume does fine-tuning become worth the overhead for you? Curious whether the 500K/month threshold matches others’ experience. submitted by /u/Individual-Bench4448

Originally posted by u/Individual-Bench4448 on r/ArtificialInteligence