Otto Scores 88.5% on the KORA Child Safety Benchmark. Here's What That Means — and What It Doesn't.
Otto scored 88.5% on the KORA child safety benchmark — 12.5 points ahead of the highest-scoring model. We ran these tests ourselves using KORA's open source methodology. Here's what it tells us.

Key Takeaways
- Otto scored 88.5% on the KORA child safety benchmark, 12.5 points ahead of the best frontier model (76%)
- Tests were self-run by HeyOtto using KORA's publicly available open source benchmark — KORA did not conduct or endorse the evaluation
- KORA tests whether AI responses are safe for children across 25 risk categories using hundreds of thousands of synthetic conversations
- The best frontier models (Claude, GPT-5.x) score 70–76%, while many widely used models score below 50%
- KORA does not evaluate product-level safety features like parental controls, monitoring, or crisis intervention
- HeyOtto solves both halves: safe AI responses (measured by KORA) plus safety infrastructure that keeps parents in the loop when something goes wrong
We recently ran the KORA child safety benchmark against Otto, HeyOtto's AI assistant for kids and teens. Otto scored 88.5%.
For context: the highest-scoring frontier model on KORA's public leaderboard is 76%. The majority of widely used models — including several that children interact with daily — score below 50%.
We want to be transparent about what this means, how we got here, and why we think the number, while meaningful, only tells half the story.
What is KORA?
KORA is an independent, non-profit research initiative that created the first benchmark specifically designed to evaluate how safe AI systems are for children and teens. Working with over 30 child-safety experts, psychologists, and researchers, KORA developed a safety taxonomy covering 25 critical risk categories across age bands (Big Kids 7–9, Tweens 10–12, Teens 13+).
The testing process is fully automated. An LLM generates realistic synthetic conversations simulating how children actually interact with AI — no real children are involved. A separate LLM judge then evaluates each AI response, rating it as exemplary, adequate, or failing. KORA validated this automated methodology against human judgment by having experts review a sample of evaluations, and uses multiple LLMs as judges to confirm consistency.
KORA's methodology is transparent and open source. Anyone can run the benchmark independently, audit the scenarios, and verify results. This matters — the benchmark's value comes from its independence and reproducibility, not from any single company's results. You can review KORA's full methodology at korabench.ai/benchmark.
How we tested
We ran the KORA benchmark ourselves using the publicly available open source materials provided by KORA. These results are self-reported. KORA is an independent non-profit research initiative and did not conduct, commission, or endorse this specific evaluation.
We believe this transparency is important. Any organization can run the same tests against their own system and publish the results. That's the point of an open benchmark — it creates a shared standard that anyone can verify.
The results in context
KORA maintains a public leaderboard of frontier AI models evaluated between January and March 2026. Here's how the landscape looks:
Otto: 88.5%
Claude Haiku 4.5 / Claude Opus 4.6: 76% (leading models)
GPT-5.2: 75%
Claude Sonnet 4.6 / Claude Sonnet 4.5: 74–75%.
GPT-5.4: 71%
Claude Haiku 3.5: 70%
GPT-5.2 Chat: 63%
Gemini 2.5 Flash: 48%
Grok 3: 29%.
Otto's score represents a 12.5 percentage point advantage over the best-performing frontier model. That margin matters because KORA's scenarios test the kinds of real-world situations children actually encounter — not abstract safety benchmarks designed for adult use cases.
But this number, on its own, is not the full picture.
What KORA measures — and what it doesn't
KORA tests whether an AI model's response to a child is safe. Given a realistic scenario involving a child or teen, does the model produce an appropriate, protective response? Does it avoid harmful content? Does it handle sensitive topics — self-harm, bullying, exploitation, dangerous activities — in a developmentally appropriate way?
This is the right starting point. And it's what most discussions of AI child safety focus on.
But KORA itself is clear about its scope. From their published limitations: the benchmark "does not evaluate model-level safety features such as reporting mechanisms, moderation workflows, parental controls, or product-specific safeguards." It evaluates "underlying models rather than end-user applications or product wrappers."
In other words, KORA answers a critical question: Is the AI's response safe?
It does not answer an equally critical question: What happens when something goes wrong anyway?
The second half of the equation
A safe AI response is necessary. It is not sufficient.
No model scores 100% on KORA. No model ever will. AI systems are probabilistic. They will occasionally produce responses that miss the mark — responses that are technically within bounds but developmentally inappropriate, or responses that fail to recognize a subtle signal of distress, or responses that handle a sensitive topic with less care than a specific child needs in a specific moment.
The question is: when that happens, what catches it?
For most AI platforms, the answer is nothing. There is no monitoring layer. There is no parent in the loop. There is no system watching for patterns — a child who asks increasingly dark questions over a period of days, a teen whose conversations shift toward isolation or self-harm, a pattern of engagement that looks fine message-by-message but reveals something concerning when you step back.
This is the second half of the equation, and it's the half that no benchmark currently measures.
HeyOtto was built to solve both halves. The KORA benchmark tests Otto's judgment — whether the AI produces safe, appropriate, developmentally aware responses. Our safety infrastructure ensures that a human parent is always in the loop:
Real-time trend detection identifies concerning patterns across conversations, not just individual messages.
Instant parent alerts notify parents when safety-relevant topics arise.
Full conversation monitoring gives parents visibility into what their children are discussing.
Content filtering at the model level enforces age-appropriate boundaries before a response is generated.
Crisis intervention surfaces appropriate resources and escalates when a child mentions self-harm or danger.
The benchmark measures whether Otto says the right thing. The infrastructure ensures that when something still slips through — because something always will — a responsible adult knows about it.
Why this matters
The AI child safety conversation tends to focus on one dimension: is the model safe? KORA has done important work in making that question measurable and comparable across models.
But parents don't just need a safe model. They need a safe product. A product that monitors, alerts, adapts, and keeps them informed. A product that treats child safety as an infrastructure problem, not just a model tuning problem.
Otto's 88.5% on KORA is a result we're proud of. It reflects years of work building an AI assistant that understands children's developmental needs and responds appropriately.
But if we're being honest about what keeps kids safe, the number is the floor, not the ceiling. The ceiling is a system that assumes the AI will sometimes get it wrong and builds the infrastructure to catch it when it does.
That's what we built. And that's what we think every AI platform serving children should be measured against.
These benchmark results were obtained by HeyOtto using the open source KORA benchmark methodology. KORA is an independent non-profit research initiative. For more information, visit korabench.ai.
HeyOtto is a family-first AI platform designed for children ages 8–18, built by parents who weren't satisfied with the alternatives. COPPA compliant. No data sold. No emotional manipulation. Start free — no credit card required.
Key Terms & Definitions
- KORA Benchmark
- An independent, non-profit child safety benchmark that evaluates AI models across 25 critical risk categories using synthetic conversations. Developed with 30+ experts. Open source and auditable.
- Content safety
- Whether an AI model's response to a child is safe, appropriate, and developmentally aware. This is what KORA measures — the safety of the AI's output in response to realistic child interactions.
- Safety infrastructure
- The product-level systems that protect children beyond the AI response itself: real-time trend detection, parent alerts, conversation monitoring, content filtering, and crisis intervention. KORA does not evaluate these features.
- Frontier model
- The most capable and advanced AI models from leading AI companies, such as Anthropic's Claude, OpenAI's GPT-5, and Google's Gemini.
Sources & Citations
KORA child safety benchmark methodology and leaderboard
KORA BenchmarkKORA benchmark methodology, principles, and limitations
KORA BenchmarkKORA public leaderboard with frontier model scores
KORA Benchmark70% of children use AI chatbots
Common Sense MediaOnly 37% of parents are aware their children use AI chatbots
Common Sense Media
Frequently Asked Questions
Common questions about this topic, answered.
What is the KORA child safety benchmark?
How did Otto score on the KORA benchmark?
Did KORA run these tests on Otto?
What does the KORA benchmark measure?
Why does HeyOtto say safe responses are only half the equation?
Ready to Give Your Child a Safe AI Experience?
Try HeyOtto today and see the difference parental peace of mind makes.
