Original Reddit post

Good afternoon fellas, Some friends and I are looking forward fine-tunning the qwen3-coder-next model for pentesting and cybersecurity with CPT and then SFT The thing is that we are not sure at all on how high quality CPT data should be? Does quantity matter most than quality in this case? Or I should clean it up somewhat to have it ready for SFT, for which we generate synthetic data as iterations (simulating conversations that would happen during a pentest engagement for example). The whole end goal of this model is to have it running for long period of time using its agentic feature (which is already built in). Any feedback or insight is madly appreciated Cheers! submitted by /u/cryptoviksant

Originally posted by u/cryptoviksant on r/ArtificialInteligence