Hey everyone! I don’t know whether this is the correct subreddit or not but I’m happy about this milestone! I wanted to share some insights and milestones from a project I’ve been developing over the past few months called Newton. The core focus of this project isn’t architectural novelty, but rather data-centric alignment: training an existing open-weights model to prioritize honesty over pleasing the user. Specifically, I wanted to target two common LLM failure modes: hallucinations and sycophancy (glazing) . I wanted a model that confidently says “I don’t know” when out of distribution, rather than making up facts or blindly agreeing with incorrect user premises. I’m currently transitioning into the deployment phase, building a custom web interface to test it in real-world scenarios. Beta testing will be available once the project gets stable. For those who have worked on fine-tuning models for strict factual adherence, what validation benchmarks or custom automated pipelines did you find most reliable to measure hallucination rates before deployment? Looking forward to your thoughts and technical feedback! And for the automod thing: This image shows the custom web interface currently being built for Newton. I am sharing this to provide context on the deployment phase of the project, moving from raw fine-tuning (17.8k rows targeting sycophancy and hallucinations) to real-world interface testing. The goal of showing the UI is to discuss how user experience design can complement model alignment when dealing with out-of-distribution prompts. submitted by /u/d4nilim0n
Originally posted by u/d4nilim0n on r/ArtificialInteligence
