Why we benchmark
Claims about AI accuracy are easy to make and hard to verify. That's why we run rigorous benchmarks comparing Ren's marking against experienced human markers - and we publish the results.
Methodology
Dataset
- Mix of short-answer (1-4 marks) and extended response (6+ marks) questions
- Covering 8 different topics across the chemistry O-Level syllabus
Marking process
Each response was marked by:
- Human markers (experienced GCSE teachers)
- Ren's AI marking system (with answer scheme uploaded)
We used the teacher scorings as ground truth, and measured Ren's agreement against this gold standard (note that for the purpose of this benchmark, we are measuring how closely does our AI grading match human grading, mismatches can occur due to many factors - human errors, our AI grading errors, our vagueness in answer scheme).
Results
| Metric | Ren Score |
|--------|-----------|
| Exact mark agreement | 82.4% |
| Within 1 mark | 96.1% |
| Cohen's Kappa (vs consensus) | 0.79 |
| Mean absolute error | 0.31 marks |
For context, the average exact agreement between any single human marker is about this ballpark. This means Ren performs within the range of normal human marker variation.
By question type
| Question Type | Exact Agreement | Within 1 Mark |
|---------------|-----------------|---------------|
| Short answer (1-2 marks) | 91.2% | 99.1% |
| Medium response (3-4 marks) | 79.8% | 95.3% |
| Extended response (6+ marks) | 68.4% | 89.2% |
As expected, accuracy is highest on structured short-answer questions and decreases for longer, more subjective responses. This is consistent with human marker behaviour - inter-marker reliability also drops on extended responses.
Key takeaways
What Ren does well
- Factual accuracy checking: Ren is excellent at identifying whether a student has included required scientific facts and key terms
- Structure recognition: The model reliably identifies whether responses follow expected structures (e.g., "describe and explain" format)
- Consistency: Unlike human markers, Ren doesn't experience fatigue effects - the 500th paper is marked with the same attention as the first
Where Ren needs teacher review
- Borderline cases: Responses that sit right on a grade boundary benefit from human judgement
- Creative or unusual answers: Students who demonstrate understanding through unconventional approaches sometimes need a teacher to recognise the validity of their reasoning, especially if this is not reflected in the answer scheme
- Handwriting artefacts: When working with scanned responses, very bad handwritting / drawings can affect accuracy
Continuous improvement
We re-run benchmarks quarterly and publish updated results. Our research and engineering team is continuously improving our grading engine through:
- Expanding training data from partner schools
- Improving rubric grounding techniques
- Working with teachers to better handle edge cases identified in previous benchmarks
Coming on board
As our early partners, our team will run benchmarks and make sure that our product works for your use cases.
Ready to run your own benchmark? Get in touch and we'll set it up.