Proofreading showdown: AI vs Humans
The last step in preparing a manuscript for a book like Inference Engineering is proofreading. Proofreading is an editorial phase that focuses on catching objective errors and inconsistencies across spelling, grammar, usage, capitalization, numbering, formatting, and so forth.
This seems like the sort of thing that LLMs should be exceptionally good at.
Inference Engineering was proofread in three steps:
- By a custom-built agentic system using multiple models and scripts.
- By a professional editor with 30 years of experience.
- By me, the author, in an 8-hour read-aloud session.
This was a cascading process. The professional editor only saw the manuscript after the AI-identified errors were fixed, and I only saw the manuscript after the professional editor's fixes were applied. This means that the AI had the advantage of taking the first shot at any low-hanging fruit.
The best agentic system I could design in December 2025 caught less than 15% of proofreading errors.
I was shocked by how many errors the AI system missed. I would have guessed that the agentic proofreader would have caught at least 80% of the errors.
Either I designed a bad harness (very plausible) or proofreading, especially long context proofreading, is an economically valuable non-saturated benchmark to consider developing for AI model evaluation.
Agentic Proofreader Design
The biggest limitation of LLMs for proofreading is their context window. Rather than dumping an entire 250-page book into a chatbox, I split the book into chunks of a few thousand tokens each, along with a prompt telling the model that is is a professional editor proofreading a manuscript.
I ran these prompts through three different models, the latest Claude, GPT, and Composer model at the time. Each model created a markdown file whenever it discovered an error for a chunk. Unfortunately, I don't have data on which models caught which errors, but I remember that all three caught most of the same errors, and each one caught maybe one or two errors that the others missed.
I also used a long-context Gemini model to review the manuscript as a whole in search of issues that would span multiple chunks, like inconsistent capitalization or mistakes in figure numbering sequences.
The agentic proofreader had a low false positive rate. Almost everything it flagged genuinely was an error. But as it turns out, many chunks it certified as correct actually contained errors.
Human Review
Proofreading is a collaborative process that benefits from multiple readers. My editor and I each did a proofreading pass following industry best practices. She read on paper, while I did a full read-aloud edit, speaking each word of the book out loud to catch issues that can remain invisible on the page. It's amazing how many issues these tricks catch even for a manuscript that has been read on the screen by dozens of people multiple times.
While many of the changes we made were somewhat subjective, we fixed a large number of objective errors and inconsistencies that the proofreading system missed. No proofreader is perfect, and I imagine that a typo here or there made it to print (if you find one, please don't tell me until I start on the second edition).
I doubt that human proofreaders will retain such an advantage over LLMs for long. I imagine that by next time I write a book, the ratio will be flipped, with the agent catching the vast majority of the errors and the human editor catching the remaining errors and subjective changes.