Recently, I ran an experiment that ended up being far more interesting than I expected. I was playing an FPS game called Weird Gun Games, a game with an extremely deep weapon customization system. Instead of simply choosing a weapon and using it, players build weapons by combining a weapon core with parts from many different weapon classes. Because the game’s mechanics are surprisingly complex, I decided to provide the complete stat spreadsheet to several AI models and ask them to design the best possible weapon. The results were unexpected. How the Game Actually Works To understand the experiment, it’s important to understand the weapon system. Weapons are built in layers. First, you choose a weapon core. The core determines the weapon’s base identity and defines many of its fundamental characteristics, such as: Damage Damage falloff Fire rate Spread ADS spread Movement speed Detection radius Suppression Equip time Recoil values Firing mode Some attributes are locked to the core and cannot be modified. For example, Time to Aim and Burst Count are marked as unchangeable in the spreadsheet. After selecting a core, players attach parts: Barrel Grip Magazine Stock Scope Each part modifies multiple statistics simultaneously. For example, a barrel may increase damage while also increasing recoil. A grip may reduce recoil but worsen reload speed. A stock may improve stability while reducing mobility. Almost every upgrade comes with a trade-off. This means that building a weapon is not about stacking the highest numbers. It is about finding synergies. Weapon Classes Have Distinct Identities The game separates weapons into several classes: Assault Rifles (AR) Snipers SMGs Shotguns LMGs Battle Rifles (BR) Sidearms Weird Weapons Each class follows a different design philosophy. SMGs focus on mobility and fire rate. Snipers focus on range and damage. Shotguns focus on pellets and close-range burst damage. LMGs focus on sustained fire and large magazines. Battle Rifles sit between ARs and Snipers. Sidearms prioritize fast handling. The Weird class contains experimental weapons with much more unusual behaviors. The important part is that the classes are not just labels. They genuinely influence how weapon parts behave. The Class Nerf System This is arguably the most important mechanic in the entire game. The spreadsheet contains a Class Nerf matrix. Every weapon part belongs to a class. Every weapon core belongs to a class. When you attach a part to a core, its effectiveness is multiplied by a class modifier. For example: AR parts on AR cores often receive 100% effectiveness. Some Sniper parts used on Sidearms may only receive 50% effectiveness. Shotguns may receive reduced benefits from Sniper parts. Weird parts generally retain full effectiveness across all classes. The actual effect becomes: Final Bonus = Base Bonus × Class Multiplier This prevents players from simply combining the strongest parts from every class. It also preserves class identity and makes optimization significantly more difficult. Why This Is A Difficult Problem For AI At first glance, this sounds simple. It isn’t. The AI doesn’t just need to identify the largest numbers. It needs to understand: Core identity Part synergies Class multipliers Trade-offs Effective stat values Intended playstyle Opportunity costs A part that looks amazing on paper may become mediocre after class multipliers are applied. A barrel that increases damage may only be worth using if another attachment compensates for its recoil penalties. In other words, the value of a part depends heavily on every other part chosen. This transforms the task from simple stat comparison into a multi-variable optimization problem. The Models That Performed Poorly I tested DeepSeek, Gemini, NotebookLM, and Grok. All of them struggled with the problem. Most of their builds appeared to focus on raw stats rather than overall weapon synergy. They often selected parts with strong individual values but failed to create coherent weapon systems. The resulting weapons looked more like collections of individually strong attachments than carefully designed builds. Claude Was The Biggest Surprise Claude was the model that surprised me the most. Given its reputation for strong reasoning abilities, I expected it to perform exceptionally well. Instead, Claude repeatedly refused to assemble complete weapons. Rather than creating a build, it usually identified what it considered the best attachment in each category and left the final assembly to me. When I pushed it to actually construct a complete weapon, its performance dropped significantly and became much closer to the weaker models. This was unexpected because I had assumed Claude would excel at a system built around trade-offs and optimization. Instead, it seemed more comfortable analyzing individual components than building a complete solution. The Three Models That Stood Out The models that performed best were: ChatGPT Qwen Kimi What makes this interesting is that they all arrived at completely different conclusions. Kimi’s Build Kimi produced the most aggressive weapon. Main stats: 19.6 → 17.2 damage 1035 RPM 51-round magazine +35.8% health -35% movement speed Kimi appeared to maximize offensive power above everything else. It sacrificed mobility heavily in exchange for absurd fire rate, strong damage output, high survivability, and a large magazine. Its philosophy seemed to be: “I don’t need to move faster if I can kill faster.” This build looked like a hybrid between an Assault Rifle and a lightweight LMG. Of all the builds, this one appeared to have the highest theoretical DPS. Qwen’s Build Qwen took a completely different approach. Main stats: 16.7 → 14.5 damage 692.6 RPM 50-round magazine +2.5% movement speed 0.8 spread 0.1 ADS spread Unlike Kimi, Qwen focused on consistency. The build offered: Excellent accuracy Strong effective range Positive mobility Low detection radius Highly controllable recoil characteristics A veteran Counter-Strike player I asked to test the builds actually preferred this one. That makes sense because experienced FPS players often value consistency, precision, and movement more than raw damage output. Qwen’s philosophy seemed to be: “If I hit more shots, I don’t need the highest damage.” ChatGPT’s Build ChatGPT produced the most balanced build. Main stats: 23.5 → 21.1 damage 508.7 RPM 40-round magazine 381 stud maximum range +2% movement speed The build focused on: High per-shot damage Strong range Good reload speed Decent mobility The major downside was spread. Compared to the other two builds, it appeared designed around making every shot count rather than maximizing either DPS or mobility. Its philosophy seemed to be: “Each bullet should have a greater impact.” The resulting weapon resembled a Battle Rifle more than a traditional Assault Rifle. What This Experiment Actually Revealed The most interesting outcome wasn’t which AI was “smartest.” It was that each model appeared to optimize for a completely different definition of what makes a weapon good. Kimi optimized for raw offensive power. Qwen optimized for practical performance and consistency. ChatGPT optimized for balance and efficiency. In a system built around trade-offs, there may not even be a single objectively correct answer. The challenge is not calculating the stats. The challenge is deciding which stats matter most. That’s why I think this experiment ended up testing something more interesting than raw reasoning ability. It tested how different AI models interpret optimization problems when the objective function is not explicitly defined. And in a game built entirely around trade-offs, synergies, and competing priorities, that difference becomes surprisingly visible. submitted by /u/John_F_Oliver
Originally posted by u/John_F_Oliver on r/ArtificialInteligence
