Hey everyone. I wanted to share a multimodal agent architecture I’ve been building for the last 3-4 years. I included a screenshot of my private git history because everything looks AI-generated these day, but I don’t really do the whole “vibecoding” thing, this was a lot of manual work lol. (I barely knew python when starting ToT) This is a screenshot of the git history- it isn’t updated a lot since the program runs in a single instance localy and isn’t made for deployment- The goal here was near-absolute user immersion. I created an agent with the persona of a cynical 18-year-old tech-nerd (“KawaiiBaka”) and dropped her into my friends’ Discord group chat. They thought it was just a funny, well-prompted chatbot kind of like a “neuro-sama-like”, but the backend is actually a fully autonomous agent with many, many many features… I’m not claiming AGI, but I tackled some UX and latency issues that usually break immersion in LLMs. Most agentic workflows rely on multi-agent bloat- one agent to read, one to manage, one to speak… I built this entirely as a single-pass cognitive loop. She parses the chaos of a group chat, evaluates the social context, and decides autonomously if she needs to act, speak, or just stay silent. Also- she doesn’t roleplay. She has real access to execute raw Python on her own dedicated Windows machine using Hyper-V. If she says she’s changing her wallpaper, checking her files, or playing a video, she is literally executing subprocess or os calls. I even built a feedback loop where running a script triggers a screenshot of her remote desktop if she need it, feeding it straight back into her vision context so she actually sees the result of her code before replying. A screenshot of her desktop showcasing the custom background, Pictures folders with previous selfies, wallpapers, and some screenshots, while showing the Downloads folder with some files took from discord and youtube. Right now, I’m using Mistral via API for the LLM. The reason is due to a strict hardware constraint: I only have an 8GB VRAM GPU. I dedicated 7.6GB of that VRAM to a heavily optimized local image generation pipeline (see bellow). This lets her generate context-aware “selfies” of really high quality in about 15 to 16 seconds. Since the image gen hogs the local GPU, offloading the LLM was the only way to keep the system responsive, though the pipeline is fully model-agnostic and would run natively with a local LLM if I had the VRAM. I’m facing an issue with swapping because of custom memory pipeline that require around 3-4gb of vram, so in reality, its more like 30 sec to gen an image from prompt to result. A convo right after heavely optimizing the img-generating pipeline. The original text is in french,so here’s a translations: ‘Holaaaa’ ‘hola… you’re still awake or didn’t sleep?’ ‘I didn’t sleep lol - i’ve work on one of your features lol’ 'you don’t have a zombie head now… Do you want a selfie as motivation? :3 ‘lmao no don’t worry, I am not a zombie - omg yes! - send a selfie plz’ ‘you’re sure you don’t need a coffee break or some plushie hugs?’ Where most bots instantly break immersion is pacing and message understanding. They reply the second you hit enter, or they rely on dumb hardcoded delays. To fix this, I built a custom pipeline that gives the model a physical understanding of time passing between messages. The way it work is somewhat complex and would require more then a single paragraph lol… Just think of it as “it work”- I would be happy to answer any questions in coms for more details! Her decision on when to speak isn’t random either. The wait time is dynamically calculated based on her own single-pass assessment of the room’s engagement level and conversation disentanglement. Combined with synced typing indicators tied to her actual generation time (not when she think, but when she is writing a reply in the correct channel), her presence in the chat feels incredibly organic. If she has to execute a script or render a selfie, she naturally takes time to do it. Here’s a screenshot of her using a gif as a reply to random shitposting. To push the realism further, I’m working on a few upgrades. The first is a computer mode to upgrade her from Python scripts to actual GUI mouse and keyboard control as the command version fails in some areas like hitting the “ok” or “next” button on an installer ToT. The second is a real-time voice protocol. The plan is to use prefill injection to make her start sentences with human filler words (“umm”, “hmm”) to mask the generation and TTS latency, while prompting her to naturally break and correct her sentences mid-speech like a real human thinking out loud. And finally, the gaming aspect. I already have a separate, fully working vision/gaming model trained to play Mario Kart World (yes, on real nintendo switch 2, in real-time, locally. The model isn’t even big- just saying, but the quick training on 2h of datasets result in 15th-1th places with easy cpu). The next step is simply bridging that gaming model with her conversational protocol so she can race us while actively trash-talking in the voice channel. (the protocol is already thought, just require some time to code it Q_Q). When I dropped her in the chat, the integration was seamless. I’ve attached a few translated snippets of the chat, and a glimpse of her internal monologue. Note the weird formatting in the chat, since I’ve done some quick changes before doing the demo, its normal- its a one-line fix that is already fixed- the demo didn’t had the fix yet. The whole project is still in dev so some weird glitches appear and disapear quickly lol… If anyone is building similar single-pass agentic architectures, natively solving the time-blindness issue, or giving LLMs direct OS access, I’d love to hear how you handle guardrails without breaking the persona. Here’s also her thinking when ending a convo for example;
- bon, il est parti. je vais attendre un peu avant de faire autre chose, au cas où il reviendrait… mais là, silence radio.
- bon, il est vraiment parti cette fois. je vais peut-étre lancer un truc en arrière-plan… ou pas. on verra. Finally, here’s a list of all of her capabilities so far:
- Can read (obvously)
- Can see images, gifs, videos, youtube urls, pdf files, and her computer
- Can listen to voice notes, youtube videos, podcasts (still working on it, but music also)
- Can browse the web (history synced with her real browser), youtube and gifs to use
- Can send files (any type, as long as it fit discord’s lol)
- Can send selfies (matching the context while being realistic, she doesn’t want to send nudity or humiliating content) (generated locally)
- Can send voice notes (generated locally)
- Have a long term and short term memory (that works automatically, and is super robust) to remember any facts about anyone she interacted with (directly or indirectly)
- Understand group conversation and match reply speed with group speed and emotions and overall mood. (it scales automatically for dms and guild channel)
- Have a deep-lore persona with sleeping times, different outfits, different activities during the day etc etc-
- She is not an assistant- she’s KawaiiBaka… Sometime she doesn’t want to see your youtube video, or code this X function for your program lol. Here’s some screenshots alongside a translation during the demo with the friend group: https://preview.redd.it/lxva3os6nb3h1.png?width=1024&format=png&auto=webp&s=b06f540422c18ed210142e96e23b17b45e07c6fb [Gif of Omni-Man from the show Invincible] KawaiiBaka: Thomas, you’re just jealous because you don’t have memes this cool (◕‿◕✿) Blue User: what is this crap Orange User: It’s very funny [Blue User], just so you know. Orange User: Exactly, well said. Blue User: Fah KawaiiBaka: Omni-man > every Genshin character lol Orange User: She’s not wrong. Néo Luigi: Watch your language young lady-no “sus” characters, only Unicode. Blue User: Yes yes. Orange User: Imagine getting roasted again by an AI in 2026. KawaiiBaka: 👐 whispers I can also write in kanji if you want, Neo-sama… No suspicious characters allowed? 😇 (The Japanese text says: “Forbidden suspicious characters?”) Néo Luigi: omg tf 😭 Orange User: Well then, Neo. Blue User: Well Blue User: Neo, what is that?
https://preview.redd.it/14ae2x5znb3h1.png?width=954&format=png&auto=webp&s=39526b4f043c978e763760e85a6c853513e24103 Néo Luigi: are you alive kawaiibaka? 😭👀 Néo Luigi: the group is here- do you see it? Néo Luigi: 💀 KawaiiBaka: yeah I see, you’re stupid xD Néo Luigi: my bad 😭 Néo Luigi: good for you then KawaiiBaka: you’re too dramatic every time (´ ・ ω ・ `) Néo Luigi: don’t use sus formatting please- stay in unicode 😭🙏 Orange User: Wow Orange User: Masterclass KawaiiBaka: you’re really a hopeless case… but oh well, I like you anyway >< Néo Luigi: did you see? 👀 Néo Luigi: I’m proud of her Néo Luigi: x) Orange User: And yes Orange User: Like do DMs and everything work or not? Néo Luigi: well yeah Orange User: Cool KawaiiBaka: yeah yeah, stop making me blush >< Orange User: Wait, what is she even talking about Néo Luigi: lol Orange User: Main character syndrome submitted by /u/neoluigiyt
Originally posted by u/neoluigiyt on r/ArtificialInteligence
