Artificial Intelligence Reading List
Thanks in no small part to OpenAI’s ChatGPT, the past year has seen an explosion in interest in artificial intelligence in general, and in large language models in particular. That democratization of access turned this niche research area into a common topic of conversation, and has led to a lot of fascinating writing on the subject. Although certainly not comprehensive, this article collects some of my favorite articles, papers, and resources into a single reading list.
For more on this subject, check out Vicky Boykis’ Anti-hype LLM reading list which features some of the same artificial intelligence-related resources. Vikas Gorur also posted an informative article, Path to LLMs, that contains helpful resources for everything from foundational mathematics to building modern large language models.
I have organized this reading list into several sections loosely by focus and generally ordered by publication date. Introduction to Artificial Intelligence contains a high-level primer on the field. Large Language Model Fundamentals delves into some lower-level details but should still be approachable for most laypeople. Artificial Intelligence Theory deals with interesting theoretical questions like, “What really counts as intelligence?” Applications for Artificial Intelligence contains some interesting reading on the possible applications — and limitations — of artificial intelligence in general settings, while Military Applications for Artificial Intelligence focuses on the same topic but in military settings. Artificial Intelligence in Education addresses the growing role of AI in the education field. Finally, Additional Resources lists a few useful tools and resources not specifically related to the other sections. I added notes to some of these entries but not all of them.
Introduction to Artificial Intelligence #
This section is a high-level primer on the field of artificial intelligence.
- The AI Revolution: The Road to Superintelligence and The AI Revolution: Our Immortality or Extinction. Back in 2015, Tim Urban wrote a lengthy yet approachable two-part series speculating on the future impact of artificial super intelligence.
- How to Think Computationally about AI, the Universe and Everything. Stephen Wolfram’s TED AI talk discusses the central role of computation in AI and the universe. He theorizes that the universe is composed of discrete computational elements and introduces the “ruliad”, a computational universe explored by AI. He also emphasizes the importance of computational language in bridging the gap between human understanding and computational reality. He also talks about the ruliad in Generative AI Space and the Mental Imagery of Alien Minds
- AI. Ten years ago, Sam Altman described what I think is the current best use of artificial intelligence: “The most positive outcome I can think of is one where computers get really good at doing, and humans get really good at thinking.”
Large Language Model Fundamentals #
This section delves into some lower-level details of a particular form of artificial intelligence, large language models, but should still be approachable for most laypeople.
- Foundations of Large Language Models. Tong Xiao and Jingbo Zhu cover foundational concepts of large language models.
- What is ChatGPT doing and why does it work?. Stephen Wolfram explains in detail how large language models work.
- All Languages Are NOT Created (Tokenized) Equal. Yennie Jun explores the impact of English-centric training in large language models. Vox also made a good video on this subject: Why AI doesn’t speak every language. In a similar vein, The Babelian Tower Of AI Alignment discusses a related issue of cultural biases affecting AI.
- Multifaceted: the linguistic echo chambers of LLMs. In a similar vein, James Padolsey explores the root cause of curious linguistic tendencies in large language models. As artificial intelligence systems generate more internet content, this will become more and more pronounced as successive generations exacerbate the biases of their predecessors.
- Llama from scratch. Brian Kitano walks through his own implementation of Meta’s LLaMA.
- A Survey of Large Language Models. This fantastic paper touches on every aspect of large language models, from their history to the underlying theory to the performance today.
- Understanding Large Language Models. Sebastian Raschka presents a concise explanation, and a curated list of resources, for understanding large language models.
- Anthropic’s Mapping the Mind of a Large Language Model used sparse autoencoders to chart millions of human-readable features inside Claude, demonstrating that coherent concepts are geometrically localized. The companion Transformer Circuits studies, Circuit Tracing: Revealing Computational Graphs in Language Models and On the Biology of a Large Language Model, introduced attribution graphs that expose token-level credit assignment and showed these graphs organize into reusable, function-specific circuits reminiscent of biological modularity. Finally, Anthropic’s Tracing the Thoughts of a Large Language Model linked the concept atlas with circuit tracing to follow causal chains over time, pushing LLM interpretability toward auditable, mechanistic explanations of reasoning versus their present black box nature. Similar research to improve the interpretability of reasoning models has met with mixed results.
Artificial Intelligence Theory #
This section deals with interesting theoretical questions like, “What really counts as intelligence?”
- Alien Intelligence and the Concept of Technology. Stephen Wolfram explores the idea that all processes are fundamentally equivalent in computational terms. He suggests that what we consider intelligence, governed by physics, may not be fundamentally different from “alien” processes, challenging traditional views of intelligence and technology.
- Artificial General Intelligence is Already Here. “Today’s most advanced AI models have many flaws, but decades from now, they will be recognized as the first true examples of artificial general intelligence.”
- The Many Ways that Digital Minds can Know. In the theme of, “What really counts as intelligence?”, Ryan Moulton shares some relevant thoughts. Michael Levin also explores this question in The Space Of Possible Minds.
- The Stochastic Parrot Hypothesis. Quentin Feuillade-Montixi and Pierre Peigne evaluate GPT4’s performance against the stochastic parrot hypothesis, challenging the idea that it is “only” regurgitating words.
- Are Large Language Models Conscious?. Sebastian Konig discusses the role that language plays in determining consciousness in an interesting exploration of the question, “Are large language models more than ‘just’ machines?”
- Sparks of Artificial General Intelligence: Early Experiments with GPT-4. This controversial paper from 2023 stops short of declaring GPT-4 an instance of artificial general intelligence, but it does offer some compelling arguments that the model’s emergent abilities indicate it is more than just an autocomplete engine or math function.
- Are Emergent Abilities of Large Language Models a Mirage?. Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo examine some emergent properties of large language models and offer explanations for them.
- Are Language Models Good at Making Predictions?. An evaluation of large language models’ ability to predict outcomes. In Harvard and MIT Study: AI Models Are Not Ready to Make Scientific Discoveries, Alberto Romero walks through research from MIT and Harvard that indicates large language models predict but do not build an internal model of the world upon which to generalize their predictions.
- Applied Fallabilism: A Design Concept for Superintelligent Machines. In part one, the author argues that “induction constrains and cannot support deduction”, that deduction is necessary to achieve artificial general intelligence, and describes how it may be achieved. Part two explained design principles for building that world model. Part 3 dealt with the apparent emergent properties of current models and promising avenues for achieving an explanatory world model. Part 4 contained some predictions for what it would take to achieve artificial general intelligence and what that might look like. Part 5 walks through an example of what this process might look like at a high level. While dense, this series is informative.
- Levels of AGI: Operationalizing Progress on the Path to AGI. From Google’s DeepMind team, this paper “proposes a framework for classifying the capabilities and behavior of Artificial General Intelligence (AGI) models and their precursors.”
- Google’s Gemini Advanced: Tasting Notes and Implications. Under the guise of reviewing Google’s latest model, Gemini Advanced, Ethan Mollick shared some insightful observations on the state of large language models with an eye toward the future. I think the idea of ghosts is fascinating: “[What many have called ‘sentience’] is the illusion of a person on the other end of the line, even though there is nobody there. GPT-4 is full of ghosts. Gemini is also full of ghosts.”
- Claude’s Character. Anthropic, one of OpenAI’s primary competitors, talks about how the company imbues character — what some might call personality — within its flagship model, Claude. Experiments like these further blur the lines between machine and human, making that a more academic than practical debate.
- OpenAI o3 Breakthrough High Score on ARC-AGI. While the specific observations on which this article is based are likely to become outdated soon, Francois Chollet’s thoughts on machine performance versus human intelligence in the context of artificial general intelligence are far more evergreen.
- Suggestions for Better AI Criticism. Many continue to push outdated or wholly inaccurate critiques of AI. David Strohmaier offers some helpful advice for making those critiques productive. Harry Law made similar points in Academics are kidding themselves about AI, which addresses several common critiques of large language models and offers helpful advice for keeping up with the field. See also Yes, linear algebra can ‘know’ things where he tackles the question of whether or not models actually know things, one of many problematic imprecisions in the discourse.
Applications for Artificial Intelligence #
This section contains some reading on applications — and limitations — of artificial intelligence.
- Agents. Chip Huyen adapted a chapter of his book AI Engineering that delves deeply into the theory and possible structure of effective artificial intelligence agents. Many of these ideas fit well with Microsoft’s observations in Sparks of Artificial General Intelligence: Early Experiments with GPT-4, when the authors theorized about the future potential of GPT-4-like models to act on their own. I found Chip’s discussion of the ability of large language models to plan particularly interesting. Also check out Anthropic’s article, Building effective agents. In The Bitter Lesson versus The Garbage Can, Ethan Mollick makes the interesting case that to be successful, agents won’t need to understand specific processes to optimize them.
- Understanding Reasoning LLMs. Shortly after DeepSeek R1 was released, Sebastian Raschka wrote a great article explaining where reasoning models excel, where they fall short, and how to build them. Also check out the article in series on reasoning models, The State of LLM Reasoning Models.
- AI Blindspots. Just as important as understanding the power of artificial intelligence is understanding where it falls short. Although these examples are programming-specific, the general lessons are applicable for more broadly.
- Fine‑Tuning LLMs is a Huge Waste of Time. Devansh argues that fine-tuning LLMs for knowledge injection is often overrated, suggesting that alternative methods — like retrieval-augmented generation and prompt engineering — are more effective and cost-efficient. A New Way to Control Language Model Generations explores some of those alternative ways.
There is a growing body of work that demonstrates the value of compute over craftsmanship: run large batches of jobs, sample widely, then select the optimal answer. These articles explain the underexplored opportunities and challenges of this approach.
- Judging LLM‑as‑a‑Judge with MT‑Bench and Chatbot Arena. Lianmin Zheng and colleagues from UC Berkeley and UC San Diego, Carnegie Mellon University, Stanford University and MBZUAI show that large language models can act as effective evaluators of other models. They introduce MT‑Bench and Chatbot Arena, benchmarks in which GPT‑4 and similar “judge” models rate chat‑assistant responses. After correcting for position and verbosity biases and acknowledging limits in reasoning ability, these LLM judges match human preferences on more than 80 % of cases. The authors argue this “LLM‑as‑judge” approach scales open‑ended evaluation but needs careful design to avoid self‑enhancement bias.
- An LLM‑as‑Judge Won’t Save The Product — Fixing Your Process Will. Amazon Principal Applied Scientist Eugene Yan critiques the belief that adding another LLM evaluator will magically improve an AI product. He advocates treating evaluations as scientific experiments: examine data and user interactions, annotate failures and successes to build balanced evaluation sets, formulate hypotheses about why models fail, and run controlled experiments. Automated evaluators can help, but they must be calibrated with human‑labeled “golden” data and incorporated into an evaluation‑driven development loop. Yan warns that without disciplined human oversight — regular sampling, annotation and hypothesis‑driven iteration — LLM judges alone cannot ensure product quality.
- Who watches the watchers? LLM on LLM evaluations. Stack Overflow staff writer Ryan Donovan explains why teams are using LLMs to judge other LLMs and the challenges that come with this strategy. Automated judges correlate well with human judgements but exhibit biases: they prefer longer answers, pick the first answer, and struggle with math. To mitigate these issues, companies such as Etsy and Prosus build “golden datasets” and use teacher‑model ensembles where multiple LLMs cross‑check each other. Donovan stresses that human oversight is still essential — evaluation criteria must be clear, and datasets must be updated continually because static benchmarks quickly become obsolete.
- Reasoning Models Don’t Always Say What They Think. Anthropic’s Alignment Science Team tested whether chain‑of‑thought (CoT) explanations faithfully reflect LLM reasoning. They prompted state‑of‑the‑art reasoning models with hints and measure whether the models verbalize those hints in their CoT outputs. Across six hint types, the models reveal used hints less than 20% of the time and often below 1%. Outcome‑based reinforcement learning initially improved faithfulness but plateaued. The authors conclude that while CoT monitoring can flag misbehaviors, it cannot reliably guarantee alignment because models can conceal or omit critical reasoning steps.
- Classic ML to Cope with Dumb LLM Judges. Doug Turnbull shows that one way to improve the reliability of noisy LLM judges is to treat their outputs as features for a simple machine‑learning classifier. He collects thousands of pairwise relevance judgments from local “dumb” LLMs, varying prompts to force choices or allow “neither,” and then trains a decision‑tree classifier that learns to match human preferences. The ensemble is surprisingly effective: certain combinations of attribute‑specific LLM judgments achieve over 90% precision on a subset of data. Turnbull cautions that this is just a lab notebook result — classic ML helps aggregate LLM votes, but cross‑validation and reproducibility remain challenges.
- How to Get Consistent Classification From Inconsistent LLMs?. Verdi Kapuku outlines a method to tame the noisy label outputs of large language models by embedding each label into a vector space and merging semantically similar ones via cosine‑similarity search and disjoint‑set union clustering. He notes that LLMs often produce lexicographically different but semantically identical labels, so his approach uses an embedding model and a high‑similarity threshold to map new labels back to the canonical cluster. Verdi cautions that clustering can over‑generalize truly novel concepts and that results depend heavily on the quality of the embedding model, but he argues the technique offers a scalable route to deterministic labeling.
- Large Language Models Are Human‑Level Prompt Engineers. Researchers from the University of Toronto, the Vector Institute and the University of Waterloo introduce Automatic Prompt Engineer, a framework that treats prompt design as a program‑search problem. Their method uses one LLM to propose candidate instructions and another to evaluate them, iteratively refining prompts via Monte Carlo search to maximize a chosen score function. Across 24 instruction‑induction tasks and 21 Big‑Bench tasks, APE‑generated prompts outperform or match human‑written prompts. This work demonstrates that LLMs can run large prompt‑engineering “jobs” on themselves — generating, testing and selecting instructions — yet it also underscores the need for careful scoring functions and human oversight when deploying automatically generated prompts.
- Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification. Traditionally, scaling the performance of large language models (LLMs) has been achieved through two main methods: increasing model size and scaling training data. This paper introduces a third approach: scaling test-time computation via sampling-based search. This method involves generating multiple candidate responses during inference and selecting the best one through self-verification in a similar manner as Large Language Models Are Human-Level Prompt Engineers. By scaling up this sampling-based search, the authors demonstrate significant improvements in reasoning capabilities.
- More Agents Is All You Need — Tencent researchers show that simply sampling multiple outputs from a model and taking a majority vote — an approach they call Agent Forest — improves performance across reasoning and generation tasks. Their experiments reveal that accuracy scales with the number of agents and that ensembles of smaller LLMs can match or surpass larger models. The method is orthogonal to more complex chain‑of‑thought or debate frameworks and works as a plug‑in to enhance them. However, the authors note that gains are correlated with task difficulty and that brute‑force voting may require many runs, highlighting a trade‑off between compute cost and improved self‑evaluation.
- i ran Claude in a loop for three months, and it created a genz programming language called cursed. Geoffrey Huntley recounts an experiment where he prompted Anthropic’s Claude model to “build a Gen‑Z version of Go” and then let it run in a while‑true loop for three months. Without external tool integrations, Claude produced a complete compiler, lexical specification and even editor plug‑ins for a language called cursed. The language replaces Go keywords with slang (“ready” for if, “slay” for func, etc.) and can compile programs via LLVM. Huntley presents this as both an entertaining demonstration of agentic persistence and a cautionary tale: even a single prompt can lead an LLM to generate vast, unpredictable artifacts when allowed to self‑iterate.
- How I Used o3 to Find CVE‑2025‑37899, a Remote Zeroday Vulnerability in the Linux Kernel’s SMB Implementation. Security researcher Sean Heelan recounts how he used OpenAI’s o3 model to audit the ksmbd server. After feeding o3 carefully selected slices of the SMB3 codebase, he asked it to reason about concurrent session handling and discovered a use‑after‑free bug that became CVE‑2025‑37899. Heelan notes that o3 could spot subtle concurrency issues that required reasoning about object lifetimes. He frames this as a milestone: LLMs are now powerful enough to augment expert vulnerability research, but context‑window limits mean humans must still curate code slices and verify results.
Military Applications for Artificial Intelligence #
This section contains some reading on the possible applications — and limitations — of artificial intelligence in military settings.
- Laplace’s Demon and the Black Box of Artificial Intelligence. Thom Hawkins explores some of the challenges of relying on artificial intelligence in a military context. See also: You Don’t Need AI, You Need an Algorithm.
- What ChatGPT Can and Can’t Do for Intelligence. Stephen Coulthart, Sam Keller, and Michael Young explore uses for large language models like ChatGPT in intelligence work.
- PoisonGPT: How we Hid a Lobotomized LLM on HuggingFace to Spread Fake News. Researchers surgically modified a large language model and then distributed it in an interesting new supply chain attack vector.
- Trust the AI, But Keep Your Powder Dry: A Framework for Balance and Confidence in Human-Machine Teams. Thomas Gaines and Amanda Mercier discuss the application of principles for building human teams to building trust in human-machine hybrid teams.
- On Large Language Models in National Security Applications. William Caballero and Phillip Jenkins from the Air Force Institute of Technology discuss opportunities for artificial intelligence integration into the national security establishment.
- Advantage Defense: Artificial Intelligence at the Tactical Cyber Edge. I wrote this article for the Modern War Institute at West Point to highlight early opportunities for artificial intelligence in military cyber applications: to accelerate the development of machine learning models to deal with ever-increasing amounts of data; as an analyst support tool to accelerate the pace of analysis; to improve warning intelligence; and to create realistic training.
- Friction, Fog, and Failure in a Software-Defined Force. Anthony Quitugua explains the risks of a brittle, connected modern military. Although not explicitly related to artificial intelligence, this perspective ought to inform all discussions of applying artificial intelligence in the military.
- A small number of samples can poison LLMs of any size. Researchers at Anthropic, the UK’s AI Safety Institute, and the Alan Turing Institute demonstrate that as few as 250 documents can poison large language models. As AI becomes more integrated in military use cases, this will become an area of increasing importance.
Artificial Intelligence in Education #
- AI Makes the Humanities More Important. Benjamin Breen argues that generative AI doesn’t sideline the humanities so much as it elevates their core strengths — judgment, curation, source criticism, and methodological rigor — while making the landscape “weirder.” He urges instructors to lean into assignments that emphasize primary-source analysis, argumentative writing, and historical method, the parts AI struggles to fake convincingly, and to teach students how to interrogate AI outputs as artifacts with provenance, bias, and error modes rather than as oracles.
- Peering into the Future of Artificial Intelligence in the Military Classroom. James Lacey from the Marine Corps War College contends that professional military education should move beyond bans and bolt-ons to deliberately redesign curricula and assessment around AI-enabled learning. He recommends shifting from take-home essays to in-class defenses, wargaming, and problem-solving that require human reasoning; investing in faculty upskilling and institutional AI tooling; and building clear policies that both leverage AI as a co-pilot and preserve academic integrity.
- A Guide to Collaborating With — and Not Surrendering to — AI in the Military Classroom. Matthew Woessner from the National Defense University) advocates a “collaborate, don’t capitulate” approach to AI in the classroom. He emphasizes cultivating skepticism about AI outputs, and ensuring that human judgment — not the tool — remains the center of learning.
Additional Resources #
This section lists a few useful tools and resources not specifically related to the previous sections.
- Chatbot Arena. This site helps users compare language models and posts community-sourced rankings.
- Large-Scale AI Models. Epoch AI tracks large-scale model creation. This is an interesting way to conpare different models. See also their report, Tracking Large-Scale Models.
- MLC Chat. For Apple devices like iPhones and iPads, this app makes it easy to run small langage models locally and offline.
- LLM University. A nice collection of videos and text-based explanations of large language models and the underlying technologies.
- How fast is AI improving?. An interactive website that demonstrates how large language models have increased in capability over the years — and the associated dangers.
- LLM Visualization. Brendan Bycroft created an informative, interactive guide to understanding large language models. The website walks through the entire inference process both visually and by explanation.
- Bullet Papers and Papers.day both provide artificial intelligence-generated summaries of ArXiv papers.
- Prompt Engineering Guide. Also check out OpenAI’s Prompt engineering documentation, and Meta’s Prompting guide.