The promise and the pitfall

Large language models can now evaluate student writing with surprising accuracy. But accuracy on benchmarks doesn't mean reliability in a classroom. When we started building Ren, we found that even the best models occasionally hallucinate feedback - confidently telling a student their answer is wrong when it's actually correct.

That's why every piece of feedback Ren generates is designed to be reviewed, edited, and approved by a teacher before it reaches a student.

Where models fail

Through our testing across thousands of student responses, we've identified three recurring failure modes:

1. Context blindness

Models don't know what was taught in class last week. A student who uses unconventional terminology picked up from a teacher's explanation may be penalised for "incorrect language" when they're actually demonstrating understanding.

2. Rubric drift

Without careful prompt engineering and grounding, models tend to apply their own implicit standards rather than the specific rubric a teacher intended. This is especially problematic in subjects like English Literature where marking criteria vary significantly between exam boards.

3. Confidence without calibration

Models rarely say "I'm not sure." They'll assign a mark with the same tone whether they're highly confident or essentially guessing. This false confidence can mislead teachers who don't have time to double-check every response.

Our approach

Ren treats AI as a first-pass assistant, not a final arbiter. Here's what that means in practice:

Every piece of feedback is presented as a draft for teacher review
Teachers can edit, approve, or reject individual feedback points
We build around explainability - eg. AI grading for structured questions will give a visual indicator of which part of the answer scheme it references

The goal isn't to replace teacher judgement - it's to give teachers a head start so they can spend their time on the feedback that matters most.

"But doesn't review add more work?"

A common objection: if teachers still have to review every piece of AI-generated feedback, does AI actually save time - or does it just shift the bottleneck?

This is the same question the software industry is grappling with. Anthropic's research on AI productivity gains found that AI can deliver significant time savings on tasks like writing code, documentation, and data manipulation. But the study also acknowledges a key limitation: it doesn't fully account for the time humans spend reviewing, editing, and validating AI outputs afterward.

In software engineering, AI hasn't eliminated the need for human judgement - it has shifted where that judgement is applied. Developers spend less time writing boilerplate and more time on code review, architecture decisions, and catching edge cases. The role evolves from creator to curator.

The same shift applies to teaching

When a teacher grades 30 scripts from scratch, most of their time goes to the mechanical work: reading, scoring against the rubric, writing similar feedback for the tenth time. The high-value work - identifying misconceptions, personalising guidance, planning interventions - gets squeezed into whatever time is left.

AI flips this ratio. Ren handles the first pass, and the teacher's time shifts to reviewing drafts and refining feedback - the higher-level work that actually impacts student outcomes.

What this looks like in practice

We tested this with a partner school. A teacher split 30 submissions into two groups of 15:

With Ren: AI graded in 9m 32s, then the teacher reviewed at ~2m 52s per script. Total: 52m 32s.
Without Ren: The teacher graded from scratch at ~10m 23s per script. Total: 155m 45s.

That's a 65%+ reduction in time - and the teacher spent their time on review and refinement rather than mechanical scoring.

The parallel to software is striking. Just as developers using AI spend less time writing code and more time reviewing it, teachers using Ren spend less time on initial grading and more time on the feedback that matters. The total hours go down, but the proportion of time spent on high-value work goes up.

Looking ahead

We're investing in better calibration techniques, rubric-grounded generation, and explainability features that show teachers why the model gave a particular mark. Our team is working very closely with teachers in the industry to refine the product such that we leverage on AI without losing out on human expertise.

Want to see how Ren handles feedback review? Get in touch to book a demo.

Why AI Marking Still Needs Human Oversight