Why we benchmark

Claims about AI accuracy are easy to make and hard to verify. That's why we run rigorous benchmarks comparing Ren's marking against experienced human markers - and we publish the results.

Methodology

Dataset

Mix of short-answer (1-4 marks) and extended response (6+ marks) questions
Covering 8 different topics across the chemistry O-Level syllabus

Marking process

Each response was marked by:

Human markers (experienced GCSE teachers)
Ren's AI marking system (with answer scheme uploaded)

We used the teacher scorings as ground truth, and measured Ren's agreement against this gold standard (note that for the purpose of this benchmark, we are measuring how closely does our AI grading match human grading, mismatches can occur due to many factors - human errors, our AI grading errors, our vagueness in answer scheme).

Results

| Metric | Ren Score |

|--------|-----------|

| Exact mark agreement | 82.4% |

| Within 1 mark | 96.1% |

| Cohen's Kappa (vs consensus) | 0.79 |

| Mean absolute error | 0.31 marks |

For context, the average exact agreement between any single human marker is about this ballpark. This means Ren performs within the range of normal human marker variation.

By question type

| Question Type | Exact Agreement | Within 1 Mark |

|---------------|-----------------|---------------|

| Short answer (1-2 marks) | 91.2% | 99.1% |

| Medium response (3-4 marks) | 79.8% | 95.3% |

| Extended response (6+ marks) | 68.4% | 89.2% |

As expected, accuracy is highest on structured short-answer questions and decreases for longer, more subjective responses. This is consistent with human marker behaviour - inter-marker reliability also drops on extended responses.

Key takeaways

What Ren does well

Factual accuracy checking: Ren is excellent at identifying whether a student has included required scientific facts and key terms
Structure recognition: The model reliably identifies whether responses follow expected structures (e.g., "describe and explain" format)
Consistency: Unlike human markers, Ren doesn't experience fatigue effects - the 500th paper is marked with the same attention as the first

Where Ren needs teacher review

Borderline cases: Responses that sit right on a grade boundary benefit from human judgement
Creative or unusual answers: Students who demonstrate understanding through unconventional approaches sometimes need a teacher to recognise the validity of their reasoning, especially if this is not reflected in the answer scheme
Handwriting artefacts: When working with scanned responses, very bad handwritting / drawings can affect accuracy

Continuous improvement

We re-run benchmarks quarterly and publish updated results. Our research and engineering team is continuously improving our grading engine through:

Expanding training data from partner schools
Improving rubric grounding techniques
Working with teachers to better handle edge cases identified in previous benchmarks

Coming on board

As our early partners, our team will run benchmarks and make sure that our product works for your use cases.

Ready to run your own benchmark? Get in touch and we'll set it up.

Our Benchmark Results: How Ren Compares to Human Markers