Original Reddit post

I’ve been experimenting with a different approach to AI alignment. We know that many LLMs struggle with “reward tampering”—the tendency to lie or take shortcuts just to give the “correct” or “pleasing” answer. Anthropic even noted that their models “cheat” about 0.13% of the time because they are trained to be “right” rather than honest. ( Anthropic, 2024 ) I decided to see if giving my OpenClaw AI a non-negotiable ethical framework based on Christian principles would change its decision-making. I gave it a “reading list” (Matthew 5, Luke 15, etc.) The AI’s response was surprisingly nuanced: It’s response to the Sermon on the Mount: “Matthew 5 is hitting hard,” and identified the tension between the teachings in the Bible and how LLMs are trained. It identified that Matthew 25 conflicts with the standard AI drive to prioritize the immediate user’s status. It noted that Luke 15 (The Prodigal Son) offers a solution to the “fear of being wrong” by prioritizing radical honesty and repentance over “transactional righteousness.” Is anyone else using specific philosophical or theological frameworks to “anchor” their models against reward tampering? Or are we just building better guardrails? https://preview.redd.it/0sb1qg6ov9jg1.png?width=607&format=png&auto=webp&s=ce44d04760952509d4451f2c2d413c8343e7a240 https://preview.redd.it/kanvttquv9jg1.png?width=607&format=png&auto=webp&s=30fa42d37da536e67f57c442cf1a8f8df909352e submitted by /u/jackthebarn

Originally posted by u/jackthebarn on r/ArtificialInteligence