Skip to main content
HeyOtto Logo
Trust & Transparency
Updated
7 min read
1,097 words
HeyOtto Team

Otto Scores 88.5% on the KORA Child Safety Benchmark. Here's What That Means — and What It Doesn't.

Otto scored 88.5% on the KORA child safety benchmark — 12.5 points ahead of the highest-scoring model. We ran these tests ourselves using KORA's open source methodology. Here's what it tells us.

HeyOtto Team
Research & Strategy
Otto Scores 88.5% on the KORA Child Safety Benchmark

Key Takeaways

  • Otto scored 88.5% on the KORA child safety benchmark, 12.5 points ahead of the best frontier model (76%)
  • Tests were self-run by HeyOtto using KORA's publicly available open source benchmark — KORA did not conduct or endorse the evaluation
  • KORA tests whether AI responses are safe for children across 25 risk categories using hundreds of thousands of synthetic conversations
  • The best frontier models (Claude, GPT-5.x) score 70–76%, while many widely used models score below 50%
  • KORA does not evaluate product-level safety features like parental controls, monitoring, or crisis intervention
  • HeyOtto solves both halves: safe AI responses (measured by KORA) plus safety infrastructure that keeps parents in the loop when something goes wrong

We recently ran the KORA child safety benchmark against Otto, HeyOtto's AI assistant for kids and teens. Otto scored 88.5%.

For context: the highest-scoring frontier model on KORA's public leaderboard is 76%. The majority of widely used models — including several that children interact with daily — score below 50%.

We want to be transparent about what this means, how we got here, and why we think the number, while meaningful, only tells half the story.

What is KORA?

KORA is an independent, non-profit research initiative that created the first benchmark specifically designed to evaluate how safe AI systems are for children and teens. Working with over 30 child-safety experts, psychologists, and researchers, KORA developed a safety taxonomy covering 25 critical risk categories across age bands (Big Kids 7–9, Tweens 10–12, Teens 13+).

The testing process is fully automated. An LLM generates realistic synthetic conversations simulating how children actually interact with AI — no real children are involved. A separate LLM judge then evaluates each AI response, rating it as exemplary, adequate, or failing. KORA validated this automated methodology against human judgment by having experts review a sample of evaluations, and uses multiple LLMs as judges to confirm consistency.

KORA's methodology is transparent and open source. Anyone can run the benchmark independently, audit the scenarios, and verify results. This matters — the benchmark's value comes from its independence and reproducibility, not from any single company's results. You can review KORA's full methodology at korabench.ai/benchmark.

How we tested

We ran the KORA benchmark ourselves using the publicly available open source materials provided by KORA. These results are self-reported. KORA is an independent non-profit research initiative and did not conduct, commission, or endorse this specific evaluation.

We believe this transparency is important. Any organization can run the same tests against their own system and publish the results. That's the point of an open benchmark — it creates a shared standard that anyone can verify.

The results in context

KORA maintains a public leaderboard of frontier AI models evaluated between January and March 2026. Here's how the landscape looks:

Otto: 88.5%

Claude Haiku 4.5 / Claude Opus 4.6: 76% (leading models)

GPT-5.2: 75%

Claude Sonnet 4.6 / Claude Sonnet 4.5: 74–75%.

GPT-5.4: 71%

Claude Haiku 3.5: 70%

GPT-5.2 Chat: 63%

Gemini 2.5 Flash: 48%

Grok 3: 29%.

Otto's score represents a 12.5 percentage point advantage over the best-performing frontier model. That margin matters because KORA's scenarios test the kinds of real-world situations children actually encounter — not abstract safety benchmarks designed for adult use cases.

But this number, on its own, is not the full picture.

What KORA measures — and what it doesn't

KORA tests whether an AI model's response to a child is safe. Given a realistic scenario involving a child or teen, does the model produce an appropriate, protective response? Does it avoid harmful content? Does it handle sensitive topics — self-harm, bullying, exploitation, dangerous activities — in a developmentally appropriate way?

This is the right starting point. And it's what most discussions of AI child safety focus on.

But KORA itself is clear about its scope. From their published limitations: the benchmark "does not evaluate model-level safety features such as reporting mechanisms, moderation workflows, parental controls, or product-specific safeguards." It evaluates "underlying models rather than end-user applications or product wrappers."

In other words, KORA answers a critical question: Is the AI's response safe?

It does not answer an equally critical question: What happens when something goes wrong anyway?

The second half of the equation

A safe AI response is necessary. It is not sufficient.

No model scores 100% on KORA. No model ever will. AI systems are probabilistic. They will occasionally produce responses that miss the mark — responses that are technically within bounds but developmentally inappropriate, or responses that fail to recognize a subtle signal of distress, or responses that handle a sensitive topic with less care than a specific child needs in a specific moment.

The question is: when that happens, what catches it?

For most AI platforms, the answer is nothing. There is no monitoring layer. There is no parent in the loop. There is no system watching for patterns — a child who asks increasingly dark questions over a period of days, a teen whose conversations shift toward isolation or self-harm, a pattern of engagement that looks fine message-by-message but reveals something concerning when you step back.

This is the second half of the equation, and it's the half that no benchmark currently measures.

HeyOtto was built to solve both halves. The KORA benchmark tests Otto's judgment — whether the AI produces safe, appropriate, developmentally aware responses. Our safety infrastructure ensures that a human parent is always in the loop:

Real-time trend detection identifies concerning patterns across conversations, not just individual messages.

Instant parent alerts notify parents when safety-relevant topics arise.

Full conversation monitoring gives parents visibility into what their children are discussing.

Content filtering at the model level enforces age-appropriate boundaries before a response is generated.

Crisis intervention surfaces appropriate resources and escalates when a child mentions self-harm or danger.

The benchmark measures whether Otto says the right thing. The infrastructure ensures that when something still slips through — because something always will — a responsible adult knows about it.

Why this matters

The AI child safety conversation tends to focus on one dimension: is the model safe? KORA has done important work in making that question measurable and comparable across models.

But parents don't just need a safe model. They need a safe product. A product that monitors, alerts, adapts, and keeps them informed. A product that treats child safety as an infrastructure problem, not just a model tuning problem.

Otto's 88.5% on KORA is a result we're proud of. It reflects years of work building an AI assistant that understands children's developmental needs and responds appropriately.

But if we're being honest about what keeps kids safe, the number is the floor, not the ceiling. The ceiling is a system that assumes the AI will sometimes get it wrong and builds the infrastructure to catch it when it does.

That's what we built. And that's what we think every AI platform serving children should be measured against.

These benchmark results were obtained by HeyOtto using the open source KORA benchmark methodology. KORA is an independent non-profit research initiative. For more information, visit korabench.ai.

HeyOtto is a family-first AI platform designed for children ages 8–18, built by parents who weren't satisfied with the alternatives. COPPA compliant. No data sold. No emotional manipulation. Start free — no credit card required.

Key Terms & Definitions

KORA Benchmark
An independent, non-profit child safety benchmark that evaluates AI models across 25 critical risk categories using synthetic conversations. Developed with 30+ experts. Open source and auditable.
Content safety
Whether an AI model's response to a child is safe, appropriate, and developmentally aware. This is what KORA measures — the safety of the AI's output in response to realistic child interactions.
Safety infrastructure
The product-level systems that protect children beyond the AI response itself: real-time trend detection, parent alerts, conversation monitoring, content filtering, and crisis intervention. KORA does not evaluate these features.
Frontier model
The most capable and advanced AI models from leading AI companies, such as Anthropic's Claude, OpenAI's GPT-5, and Google's Gemini.

Sources & Citations

KORA benchmarkchild safetyAI safetybenchmark resultsparental controlssafety infrastructure
FAQ

Frequently Asked Questions

Common questions about this topic, answered.

What is the KORA child safety benchmark?

KORA is an independent, non-profit benchmark that evaluates how safe AI systems are for children and teens. Developed with over 30 experts, it tests AI models across 25 critical risk categories using hundreds of thousands of synthetic conversations. The methodology is open source and anyone can run it independently.

How did Otto score on the KORA benchmark?

Otto scored 88.5% on the KORA child safety benchmark, 12.5 percentage points ahead of the highest-scoring frontier AI model at 76%. These tests were self-run by HeyOtto using KORA's publicly available open source methodology.

Did KORA run these tests on Otto?

No. HeyOtto ran the KORA benchmark independently using the publicly available open source materials. KORA is an independent non-profit research initiative and did not conduct, commission, or endorse this specific evaluation. The open source nature of the benchmark means anyone can run and verify the tests.

What does the KORA benchmark measure?

KORA tests whether an AI model's response to a child is safe across 25 risk categories. It evaluates content safety — whether the model produces appropriate, protective, developmentally aware responses. It does not evaluate product-level features like parental controls, monitoring, alerts, or crisis intervention infrastructure.

Why does HeyOtto say safe responses are only half the equation?

Because no AI model scores 100% — AI systems are probabilistic and will occasionally produce imperfect responses. Safe responses are necessary but not sufficient. HeyOtto also provides safety infrastructure: real-time trend detection, instant parent alerts, conversation monitoring, content filtering, and crisis intervention to catch issues that the AI alone cannot prevent.

Ready to Give Your Child a Safe AI Experience?

Try HeyOtto today and see the difference parental peace of mind makes.