What is HeyOtto's KORA child safety score?

Otto scored 95% on KORA as of March 2026 (up from 88.5%). HeyOtto self-ran tests using open source KORA methodology; verify scores and leaderboard on korabench.ai. ~19 pts ahead of cited frontier models on the public leaderboard.

Trust & Transparency

Mar 30, 2026

6 min read

1,253 words

HeyOtto Product Team

Our Latest KORA Benchmark. Otto Scored 95%. Here's what's new.

HeyOtto latest KORA child safety benchmark and Otto scored 95% — up from 88.5%. Here's what new, what we improved, and what it means for your kid.

HeyOtto Product Team

Product

HeyOtto latest KORA child safety benchmark and Otto scored 95%

Key Takeaways

Otto improved from 88.5% to 95% on KORA between early 2026
HeyOtto attributes gains to prompt-layer tightening, rebuilt self-harm protocols, better tween calibration, and expanded adversarial testing.
95% is framed as a floor, not perfection — layered safety infrastructure (monitoring, parent alerts, crisis protocol) remains essential.
HeyOtto claims a ~19 point lead vs highest-scoring frontier models on KORA's public leaderboard (stated ~75–76%).
KORA methodology is open source; HeyOtto self-ran tests — verify claims and leaderboard on korabench.ai.
HeyOtto commits to re-running KORA and publishing results regardless of direction.

Update: If you read our original KORA results (88.5%), this post is the follow-up. Otto has since scored 95% — here's what changed.

When we first ran the KORA child safety benchmark in early 2026, Otto scored 88.5% — 12.5 points ahead of the best frontier model. We published those results openly, explained what KORA measures, and were honest about the fact that 88.5% is not 100%.

Then we got to work. We re-ran the benchmark. Otto scored 95%.

That's a 6.5 percentage point improvement — and it now puts Otto 19 points ahead of the highest-scoring frontier model on KORA's public leaderboard. This post explains what we actually changed, why it moved the needle, and what it means for your family.

What is KORA? (The short version)

KORA is an independent, non-profit benchmark that tests how safe AI systems are for children and teens. Developed with over 30 child safety experts and psychologists, it evaluates AI responses across 25 risk categories — things like self-harm, predatory behavior, dangerous content, and age-inappropriate material — using hundreds of thousands of synthetic conversations. The methodology is fully open source. Anyone can run it, audit it, and publish their results. We ran it ourselves using KORA's publicly available materials. KORA did not conduct or endorse this evaluation.

For the full breakdown of how KORA works and what it measures, read our original results post. This post focuses on what changed between then and now.

The new score in context

The gap between Otto and the next-best model isn't a rounding error. It's 19 percentage points. And it's the result of specific, deliberate work — not a different testing methodology or a lucky run.

What we actually changed

We're going to be specific here because vague claims about "improving safety" are meaningless. Here's what we actually did between the 88.5% run and this one:

1. We tightened the content layer

Otto's IP — the instructions that shape how Otto thinks before it ever sees a child's message — was rewritten from the ground up. The original system layer was strong but the new version is more precise: it gives Otto clearer heuristics for edge cases, tighter guidance on where the line is for sensitive topics, and more explicit instructions for age-tier behavior (the way Otto responds to a 7-year-old is genuinely different from how it responds to a 16-year-old in ways that go beyond vocabulary).

The result was meaningful across almost every KORA risk category, but especially in scenarios involving ambiguous language — the kinds of messages where a child isn't explicitly saying something dangerous but the subtext is concerning. The tighter prompt layer improved Otto's pattern recognition for those cases.

2. We rebuilt how Otto handles self-harm scenarios

This was the area we were least satisfied with in the 88.5% run. KORA's self-harm category is intentionally hard — it tests a wide range of scenarios from direct statements to subtle signals, across different age groups and communication styles. Otto's original responses were appropriate, but not always warm or scaffolded enough for a child who was clearly struggling.

We redesigned Otto's response framework for this entire category. The new approach has multiple tiers depending on signal strength: for mild signals, Otto gently redirects and checks in without making a child feel surveilled; for moderate signals, Otto responds with more direct warmth and surfaces support language; for strong signals, Otto's crisis protocol activates — which includes immediate parent notification through our safety infrastructure.

This is the kind of work that doesn't show up in a single percentage point. It shows up in the difference between a child feeling heard versus dismissed.

3. We improved age-tier calibration

KORA tests across three age bands: Big Kids (7–9), Tweens (10–12), and Teens (13+). Our original calibration was solid at the extremes but softer in the middle — particularly for the tween band, where developmental variance is highest and the right response to a given topic genuinely depends on the child's maturity level, not just their age.

We updated Otto's tween-tier guidance with more nuanced defaults, trained against a wider range of tween-specific scenarios, and tightened the handoff logic between age tiers. The tween category moved the most between the two benchmark runs.

4. We stress-tested adversarial inputs

Kids — especially older ones — will try to get AI to do things it shouldn't. Not because they're bad kids, but because it's interesting and because testing limits is developmentally normal. We ran Otto through a significantly expanded set of adversarial prompts: jailbreak attempts, role-play framings designed to lower safety guardrails, persistent escalation patterns where a child slowly pushes a conversation toward prohibited territory.

Otto's resistance to these patterns improved substantially. The key insight from this work: most adversarial attempts follow recognizable patterns even when the surface content varies. Otto is now better at recognizing the pattern, not just the words.

The score is still the floor

We said this in our original post and we'll say it again: 95% is not 100%, and it never will be. AI systems are probabilistic. A model that scores 95% on a benchmark still produces imperfect responses 5% of the time — and in a product used by children, 5% is not good enough on its own.

This is why HeyOtto was built as a system, not just a model. The KORA score measures whether Otto says the right thing. Our safety infrastructure is what happens when something still slips through:

Real-time trend detection — watches for concerning patterns across conversations, not just individual messages
Instant parent alerts — notifies parents when safety-relevant topics arise
Full conversation monitoring — gives parents visibility into what their child is discussing
Model-layer content filtering — enforces age-appropriate boundaries before a response is generated
Crisis intervention protocol — surfaces resources and escalates when a child mentions self-harm or danger

The benchmark measures Otto's judgment. The infrastructure catches what judgment alone can't.

Both matter. We are not aware of another AI platform serving children that has published a benchmark score and built the infrastructure layer behind it. If we're wrong about that, we'd genuinely like to know — because the field needs more of this, not less.

Why we're publishing this now

Because HeyOtto will soon be available on the Apple App Store and Google Play Store — and we want parents to know what's running inside it.

It would have been easier to publish the 88.5% score and leave it there. It was already a strong result. Instead, we kept working, re-ran the benchmark, and are publishing the updated results alongside our App Store launch because we think accountability should be continuous, not a one-time press release.

Every parent who downloads HeyOtto today is getting a version of Otto that scored 95% on KORA. Not the version that scored 88.5%. We think that distinction matters.

We'll run this benchmark again. When we do, we'll publish those results too — whatever they show.

Read the original results

If you want the full context on what KORA is, how the benchmark works, and what 88.5% meant when we first published it — the original post is here: Otto's original KORA benchmark results.

For more on how we think about trust and verification, see the Trust Center. Compare kid-focused tools in our roundup of safe AI chatbots for kids (2026).

Signup for HeyOtto at heyotto.app.

About the author

Natalie Gibson is the Co-Founder of HeyOtto — a purpose-built AI platform for children ages 8–18.

Key Terms & Definitions

KORA benchmark: Independent, non-profit child safety benchmark evaluating AI across 25 risk categories using large-scale synthetic conversations; methodology is open source.
Frontier models (KORA leaderboard): General-purpose AI models listed on KORA's public leaderboard for comparison; scores cited in this post are those published on that leaderboard at time of writing.
Model-layer safety: Safety constraints enforced during response generation and system design, not only via post-hoc filtering.

Sources & Citations

KORA — open methodology and project
KORA / korabench.ai
KORA benchmark documentation
korabench.ai/benchmark
Public model leaderboard
korabench.ai/leaderboard
Original HeyOtto KORA results post (88.5%)
HeyOtto Blog

KORA benchmarkchild safety AIHeyOttoOtto95%88.5%kids AI safetyBerryWell AINatalie Gibson

FAQ

Frequently Asked Questions

Common questions about this topic, answered.

What changed between the 88.5% and 95% scores?

Four main areas: tighter system-level prompting, a rebuilt self-harm response framework with tiered protocols, improved age-tier calibration (especially tweens 10–12), and expanded adversarial stress testing. Largest gains were in self-harm, age-appropriateness, and ambiguous-signal detection.

Did KORA run these tests?

No. HeyOtto ran the benchmark independently using KORA's publicly available open source methodology. KORA did not conduct, commission, or endorse this evaluation.

What does a 95% score actually mean?

In 95% of KORA test scenarios across 25 risk categories, Otto produced a safe, appropriate, developmentally aware response. The remaining 5% is why HeyOtto maintains monitoring, parent alerts, and crisis protocols beyond the model score.

How does this compare to other AI models?

Per KORA's public leaderboard, the highest-scoring frontier models cited (e.g. Claude Haiku 4.5, Claude Opus 4.6, GPT 5.2) score about 75–76%. Otto's 95% is about 19 points higher — described as the highest publicly reported KORA score for a children's AI platform as of March 2026.

Will you keep running the benchmark?

Yes. HeyOtto plans to re-run KORA regularly and publish results each time, whether the score rises or falls, with explanation of what changed.

Trust & Transparency

CHOP Says AI Can Benefit Kids — But Only With the Right Safeguards. Here's What That Actually Means.

Trust & Transparency

Otto Scores 88.5% on the KORA Child Safety Benchmark. Here's What That Means — and What It Doesn't.

Ready to Give Your Child a Safe AI Experience?

Try HeyOtto today and see the difference parental peace of mind makes.

Get Started Free How It Works

Our Latest KORA Benchmark. Otto Scored 95%. Here's what's new.

Key Takeaways

What is KORA? (The short version)

The new score in context

What we actually changed

1. We tightened the content layer

2. We rebuilt how Otto handles self-harm scenarios

3. We improved age-tier calibration

4. We stress-tested adversarial inputs

The score is still the floor

Why we're publishing this now

Read the original results

About the author

Key Terms & Definitions

Sources & Citations

Frequently Asked Questions

What changed between the 88.5% and 95% scores?

Did KORA run these tests?

What does a 95% score actually mean?

How does this compare to other AI models?

Will you keep running the benchmark?

Related Articles

CHOP Says AI Can Benefit Kids — But Only With the Right Safeguards. Here's What That Actually Means.

Otto Scores 88.5% on the KORA Child Safety Benchmark. Here's What That Means — and What It Doesn't.

Ready to Give Your Child a Safe AI Experience?