Original Reddit post

Wanted to understand how the core transformer papers actually connect at the concept level — not just “Paper B cites Paper A” but what specific methods, systems, and ideas flow between them. I ran 12 foundational papers (Attention Is All You Need, BERT, GPT-2/3, Scaling Laws, ViT, LoRA, Chain-of-Thought, FlashAttention, InstructGPT, LLaMA, DPO) through https://github.com/juanceresa/sift-kg (open-source CLI) — point it at a folder of documents + any LLM, get a knowledge graph. 435-entity knowledge graph with 593 relationships for ~$0.72 in API calls (gpt 4o-mini). Graph: https://juanceresa.github.io/sift-kg/transformers/graph.html — interactive and runs in browser. Some interesting structural patterns:

  • GPT-2 is the most connected node — it’s the hub everything flows through. BERT extends it, FlashAttention speeds it up, LoRA compresses it, InstructGPT fine-tunes it with RLHF
  • The graph splits into 9 natural communities. “Human Feedback and Reinforcement Learning” is the largest (24 entities), which tracks with how much of recent progress is RLHF-shaped
  • Chain-of-Thought Prompting bridges the reasoning cluster to the few-shot learning cluster — it’s structurally a connector between two different research threads
  • Common Crawl and BooksCorpus show up as shared infrastructure nodes connecting multiple model lineages Fully explorable focus view on any node to highlight it’s connections and traverse using arrow keys. Enter to select the next node to start a trail! submitted by /u/garagebandj

Originally posted by u/garagebandj on r/ArtificialInteligence