The score is the easy part
Previously, we built a benchmark engine to benchmark structured questions. Getting the mark right is straightforward. Either the score matches the teacher's or it doesn't.
Trying to benchmark essay feedback is harder. A piece of feedback can be accurate but useless ("the argument is weak"), or genuinely helpful even if it doesn't use the same words as the teacher. You need to know: did the AI catch the same issues the teacher caught? Did it explain them clearly? Did it land on the right score?
These are three different questions, and they do not always move together. An AI that scores perfectly can still write vague, unhelpful comments. One that writes beautifully detailed feedback can consistently under-mark.
That is why we built a benchmark specifically for essay feedback - one that measures all three independently, and why the design matters as much as the grading engine itself.
Why this matters: more than just a number
Most claims about AI quality are hard to verify. "Our model is accurate" is easy to say and hard to challenge. A benchmark makes the claim falsifiable.
Tracking genuine improvement. Every time we change something in the grading engine - a prompt adjustment, a new model version, a different feedback structure - we run the same essays through the same benchmark. If the score goes up, the change helped. If it goes down, we revert. Without this, "our AI got better" is an assertion. With this, it's a measurement.
Detecting hallucination. This is one of the most serious failure modes in AI grading: the model generates fluent, confident feedback that has nothing to do with the student's actual essay. Our benchmark catches this through the AI judge, which collapses near zero even when other components show coincidental partial matches - generic educational phrasing like "the argument is unclear" can accidentally appear relevant across essays, but it doesn't fool an evaluator reading the actual content.
Aligning with human-level quality. The benchmark measures alignment with actual teacher judgement. A high overall score means the AI's feedback is close to what an experienced teacher would have written - not as a claim, but as a measurement.
Three questions, one score
Essay feedback quality comes down to three questions a teacher would ask:
- Did it catch the right things? → Issue Recall
- Was the feedback actually useful? → AI Judge Score
- Did it get the grade right? → Score Accuracy
Each is weighted in that order - recall matters most, score accuracy least. The ordering is deliberate. Students receive AI feedback before a teacher reviews it. A missed structural argument - a thesis that contradicts itself, a central claim left unsupported - does lasting damage that no score correction can undo. Generic feedback does too. A wrong mark, by contrast, can be fixed in seconds.
Issue Recall: catching what the teacher caught
Issue recall measures whether the AI identified the same problems the teacher flagged. Like most recall-focused metrics, we penalise misses, not extras. The ideal AI catches everything the teacher did, and possibly more. Therefore, extra AI observations should not hurt the score; only missed teacher comments should.
In addition, not all comments carry equal weight. A missed observation about a student's thesis or central argument costs more than a missed note about a comma. Comments touching on structure, argument, evidence, and conclusion are weighted more heavily, while surface corrections like typos and punctuation weigh less. This reflects a deliberate belief: if the AI misses what matters most, strong recall on minor issues shouldn't disguise that.
AI Judge: what numbers can't capture
The most important question - "would this feedback actually help the student?" - is inherently qualitative.
We address this with an AI judge: a separate LLM that reads both the teacher's marking and the AI's feedback and evaluates issue identification, feedback quality, and missed issues independently. Like with recall, the judge is explicitly instructed not to penalise extra AI comments - only missed ones. A one-sided bar: do at least as much as the teacher; do more if you can.
Score Accuracy: why grade-band errors matter disproportionately
A 2-mark error on a 25-mark essay is a rounding issue. A 6-mark error might flip a student from one grade band to another. These are not proportionally different problems - the second is categorically worse.
We apply exponential decay once errors exceed a small tolerance. Being consistently close is strongly rewarded; occasional large divergences are penalised sharply:
| AI Score (teacher = 18/25) | Score Accuracy |
|---------------------------|---------------|
| 17.75 (1% off) | 99/100 |
| 18.5 (2% off) | 98/100 |
| 20.5 (10% off) | 67/100 |
| 22.5 (18% off) | 28/100 |
Why recall is harder than it sounds
Our first benchmark measured similarity at the comment level: match each AI comment to the closest teacher comment using semantic embeddings - numerical representations of text that capture meaning rather than exact wording, so "unclear thesis" and "the argument lacks a central claim" score as similar even though they share no words - and score accordingly. It worked in simple cases. But it systematically under-counted recall whenever the AI - as it tends to - wrote one detailed paragraph covering what a teacher had flagged across three separate annotations.
The arithmetic is punishing. Match that paragraph to its best-fitting teacher comment, and it can't be matched to the others. Two of three issues go undetected even though the AI addressed all of them - giving 33% recall on a response that was substantively correct. Length normalisation doesn't rescue this. Once a comment is matched, it's consumed. You can't apportion a single AI comment across multiple teacher points by adjusting for word count - so the problem is structural, not metric.
Our fix is to decompose comments into atomic issues before matching. A teacher annotation like "the thesis is too hedged - commit to a position, and explain why the constructivist framing is necessary" contains two distinct points. An AI comment that addresses both as part of a longer passage can now match both, rather than being consumed by the first. This is how the current benchmark handles the gap between terse teacher annotations and verbose AI elaboration (which it tends to be).
We also carry forward a design decision from the original benchmark: an LLM verification step for borderline matches, where embedding similarity falls in a grey zone. Rather than setting a hard threshold and dropping legitimate matches, a lightweight LLM call confirms whether two comments are addressing the same underlying issue. On the essay below, this resolved three additional matches that embeddings alone would have missed.
A final adjustment: a lookahead window that searches adjacent paragraphs for unmatched teacher comments. Teachers sometimes annotate the beginning of a paragraph for an issue that spans into the next one; the AI may address it a paragraph later. Without lookahead, that's a miss. With it, it's a match.
In practice: a real essay
The essay is a JC Knowledge and Inquiry paper on moral epistemology: "We can know what is right for us individually, but we can never know what is right for the whole of humanity." The student takes a constructivist position - rejecting moral realism but arguing that collective, pragmatic knowledge of right and wrong is still possible.
The teacher left 26 comments across 12 paragraphs: notes on extreme framing in the introduction, requests for reasoning on specific claims, word choice corrections, and a structural concern that the constructivist argument - the essay's central contribution - was too brief relative to everything that preceded it.
The AI matched 19 of those 25 comments and added 55 observations the teacher didn't make, including:
- Philosophical enrichment: flagging that the divine command section should engage with the Euthyphro dilemma - a canonical counter-argument in ethics that the teacher didn't annotate
- Argumentation critique: a more detailed treatment of Moore's open question argument and where the student's use of it falls short
- Internal contradiction: noting a tension in the student's constructivist framing - that claiming collective pragmatic knowledge of right and wrong risks collapsing into majoritarianism, and that this needs to be addressed
- Structural signposting: suggesting the student more clearly distinguish feasibility critiques from adequacy critiques across the essay's middle sections
The six missed comments were mostly specific to the teacher's reading: concerns about "yardsticks for progress" framing and over-extreme claims in the introduction, the teacher's pointed note that the essay's central constructivist argument was structurally underdeveloped, and a handful of targeted word choices the teacher had circled.
The AI judge's assessment:
The AI provides detailed, actionable philosophical and language feedback and often aligns with the teacher's concerns - over-extremity, need to explain reasoning, clarify terms. However, it misses several teacher-specific points: the teacher repeatedly flags claims as too extreme, asks for specific clarifications and attributions, and emphasises that the constructivist solution is the main focus and is underdeveloped. Overall, AI feedback is higher-quality and broader than the teacher's, but does not capture all of the teacher's idiosyncratic corrections and emphasis on proportion and focus.
Overall score: 74/100. Issue recall: 76%. AI judge: 71/100. The AI under-marked by 3 points (22 vs. 25 out of 30) - a 10% gap on score accuracy. Three of the 19 matched pairs were borderline embedding cases resolved by the LLM verification step.
How we guarantee the metric itself is honest
A benchmark is only useful if it reliably distinguishes good grading from bad. We verify this with sanity checks that run automatically on every release. The test is deliberately extreme: a worst case using AI comments from a completely different essay, and a best case using the teacher's own marking as AI output.
| | Worst Case | Best Case |
|---|-----------|-----------|
| Comment Recall | 16.7% | 94.4% |
| Score Accuracy | 33.4/100 | 98.3/100 |
| AI Judge Score | 2.0/100 | 97.0/100 |
| Overall Score | 15.6/100 | 96.2/100 |
A few of these numbers deserve explanation.
Worst-case recall is 16.7%, not zero. Comments lifted from a completely different essay contain occasional generic educational phrasing that can accidentally clear an embedding threshold - "the argument is unclear", "more evidence needed" - producing coincidental matches. This is a known limitation of any embedding-based matching system, and part of why the benchmark uses three components rather than relying on any one.
Worst-case score accuracy is 33.4/100, not zero. The AI grading the wrong essay still assigned a numerical score, which happened to land within a certain distance of the teacher's mark - by coincidence, not by comprehension. Exponential decay penalises this to 33.4, but it doesn't collapse entirely.
The AI judge is what catches both. At 2.0/100 on completely irrelevant feedback, this is the metric that cannot be fooled by surface coincidences. When recall and score accuracy can be gamed by chance, the judge reads the actual content and renders a verdict. The 80-point overall gap - 15.6 worst to 96.2 best - is the number that matters.
Still improving
The benchmark described here is not the final version. This is but one step in an ongoing process, and publishing it is part of that - making the measurement transparent so it can be challenged and refined.
Multi-teacher calibration. A student essay marked by two teachers rarely produces identical feedback. The current benchmark measures against a single teacher's standard, which conflates genuine AI errors with normal inter-teacher variation. The richer version compares against multiple teachers to separate the two.
Cross-subject expansion. The current benchmark focuses on argumentative essays. Narrative writing, scientific reports, and structured exam responses each have different quality dimensions. Expanding the test set is the next frontier.
Longitudinal quality tracking. Does the AI maintain the same quality at its 500th essay as its first? Consistency over scale is a different question from accuracy on a sample - and an important one for schools deploying at volume.
The deeper goal behind all of this is never to build an AI that replaces teacher judgement. It is to build one that extends it - catching what a teacher would catch, and sometimes catching what they didn't have time to. The benchmark is how we keep ourselves honest about how well we're doing that. This is one improvement among many we are working on, and we will continue to share them as they mature.
Interested in how Ren approaches essay grading at your school? Get in touch to learn more.