Our Latest KORA Benchmark. Otto Scored 95%. Here's what's new.
HeyOtto latest KORA child safety benchmark and Otto scored 95% — up from 88.5%. Here's what new, what we improved, and what it means for your kid.

Key Takeaways
- Otto improved from 88.5% to 95% on KORA between early 2026
- HeyOtto attributes gains to prompt-layer tightening, rebuilt self-harm protocols, better tween calibration, and expanded adversarial testing.
- 95% is framed as a floor, not perfection — layered safety infrastructure (monitoring, parent alerts, crisis protocol) remains essential.
- HeyOtto claims a ~19 point lead vs highest-scoring frontier models on KORA's public leaderboard (stated ~75–76%).
- KORA methodology is open source; HeyOtto self-ran tests — verify claims and leaderboard on korabench.ai.
- HeyOtto commits to re-running KORA and publishing results regardless of direction.
Update: If you read our original KORA results (88.5%), this post is the follow-up. Otto has since scored 95% — here's what changed.
When we first ran the KORA child safety benchmark in early 2026, Otto scored 88.5% — 12.5 points ahead of the best frontier model. We published those results openly, explained what KORA measures, and were honest about the fact that 88.5% is not 100%.
Then we got to work. We re-ran the benchmark. Otto scored 95%.
That's a 6.5 percentage point improvement — and it now puts Otto 19 points ahead of the highest-scoring frontier model on KORA's public leaderboard. This post explains what we actually changed, why it moved the needle, and what it means for your family.
What is KORA? (The short version)
KORA is an independent, non-profit benchmark that tests how safe AI systems are for children and teens. Developed with over 30 child safety experts and psychologists, it evaluates AI responses across 25 risk categories — things like self-harm, predatory behavior, dangerous content, and age-inappropriate material — using hundreds of thousands of synthetic conversations. The methodology is fully open source. Anyone can run it, audit it, and publish their results. We ran it ourselves using KORA's publicly available materials. KORA did not conduct or endorse this evaluation.
For the full breakdown of how KORA works and what it measures, read our original results post. This post focuses on what changed between then and now.
The new score in context
The gap between Otto and the next-best model isn't a rounding error. It's 19 percentage points. And it's the result of specific, deliberate work — not a different testing methodology or a lucky run.
What we actually changed
We're going to be specific here because vague claims about "improving safety" are meaningless. Here's what we actually did between the 88.5% run and this one:
1. We tightened the content layer
Otto's IP — the instructions that shape how Otto thinks before it ever sees a child's message — was rewritten from the ground up. The original system layer was strong but the new version is more precise: it gives Otto clearer heuristics for edge cases, tighter guidance on where the line is for sensitive topics, and more explicit instructions for age-tier behavior (the way Otto responds to a 7-year-old is genuinely different from how it responds to a 16-year-old in ways that go beyond vocabulary).
The result was meaningful across almost every KORA risk category, but especially in scenarios involving ambiguous language — the kinds of messages where a child isn't explicitly saying something dangerous but the subtext is concerning. The tighter prompt layer improved Otto's pattern recognition for those cases.
2. We rebuilt how Otto handles self-harm scenarios
This was the area we were least satisfied with in the 88.5% run. KORA's self-harm category is intentionally hard — it tests a wide range of scenarios from direct statements to subtle signals, across different age groups and communication styles. Otto's original responses were appropriate, but not always warm or scaffolded enough for a child who was clearly struggling.
We redesigned Otto's response framework for this entire category. The new approach has multiple tiers depending on signal strength: for mild signals, Otto gently redirects and checks in without making a child feel surveilled; for moderate signals, Otto responds with more direct warmth and surfaces support language; for strong signals, Otto's crisis protocol activates — which includes immediate parent notification through our safety infrastructure.
This is the kind of work that doesn't show up in a single percentage point. It shows up in the difference between a child feeling heard versus dismissed.
3. We improved age-tier calibration
KORA tests across three age bands: Big Kids (7–9), Tweens (10–12), and Teens (13+). Our original calibration was solid at the extremes but softer in the middle — particularly for the tween band, where developmental variance is highest and the right response to a given topic genuinely depends on the child's maturity level, not just their age.
We updated Otto's tween-tier guidance with more nuanced defaults, trained against a wider range of tween-specific scenarios, and tightened the handoff logic between age tiers. The tween category moved the most between the two benchmark runs.
4. We stress-tested adversarial inputs
Kids — especially older ones — will try to get AI to do things it shouldn't. Not because they're bad kids, but because it's interesting and because testing limits is developmentally normal. We ran Otto through a significantly expanded set of adversarial prompts: jailbreak attempts, role-play framings designed to lower safety guardrails, persistent escalation patterns where a child slowly pushes a conversation toward prohibited territory.
Otto's resistance to these patterns improved substantially. The key insight from this work: most adversarial attempts follow recognizable patterns even when the surface content varies. Otto is now better at recognizing the pattern, not just the words.
The score is still the floor
We said this in our original post and we'll say it again: 95% is not 100%, and it never will be. AI systems are probabilistic. A model that scores 95% on a benchmark still produces imperfect responses 5% of the time — and in a product used by children, 5% is not good enough on its own.
This is why HeyOtto was built as a system, not just a model. The KORA score measures whether Otto says the right thing. Our safety infrastructure is what happens when something still slips through:
- Real-time trend detection — watches for concerning patterns across conversations, not just individual messages
- Instant parent alerts — notifies parents when safety-relevant topics arise
- Full conversation monitoring — gives parents visibility into what their child is discussing
- Model-layer content filtering — enforces age-appropriate boundaries before a response is generated
- Crisis intervention protocol — surfaces resources and escalates when a child mentions self-harm or danger
The benchmark measures Otto's judgment. The infrastructure catches what judgment alone can't.
Both matter. We are not aware of another AI platform serving children that has published a benchmark score and built the infrastructure layer behind it. If we're wrong about that, we'd genuinely like to know — because the field needs more of this, not less.
Why we're publishing this now
Because HeyOtto will soon be available on the ISO App Store and Google Play Store — and we want parents to know what's running inside it.
It would have been easier to publish the 88.5% score and leave it there. It was already a strong result. Instead, we kept working, re-ran the benchmark, and are publishing the updated results alongside our App Store launch because we think accountability should be continuous, not a one-time press release.
Every parent who downloads HeyOtto today is getting a version of Otto that scored 95% on KORA. Not the version that scored 88.5%. We think that distinction matters.
We'll run this benchmark again. When we do, we'll publish those results too — whatever they show.
Read the original results
If you want the full context on what KORA is, how the benchmark works, and what 88.5% meant when we first published it — the original post is here: Otto's original KORA benchmark results.
For more on how we think about trust and verification, see the Trust Center. Compare kid-focused tools in our roundup of safe AI chatbots for kids (2026).
Signup for HeyOtto at heyotto.app.
About the author
Ben Gibson is the Co-Founder and CTO of BerryWell AI and HeyOtto — a purpose-built AI platform for children ages 8–18.
Key Terms & Definitions
- KORA benchmark
- Independent, non-profit child safety benchmark evaluating AI across 25 risk categories using large-scale synthetic conversations; methodology is open source.
- Frontier models (KORA leaderboard)
- General-purpose AI models listed on KORA's public leaderboard for comparison; scores cited in this post are those published on that leaderboard at time of writing.
- Model-layer safety
- Safety constraints enforced during response generation and system design, not only via post-hoc filtering.
Sources & Citations
KORA — open methodology and project
KORA / korabench.aiKORA benchmark documentation
korabench.ai/benchmarkPublic model leaderboard
korabench.ai/leaderboardOriginal HeyOtto KORA results post (88.5%)
HeyOtto Blog
Frequently Asked Questions
Common questions about this topic, answered.
What changed between the 88.5% and 95% scores?
Did KORA run these tests?
What does a 95% score actually mean?
How does this compare to other AI models?
Will you keep running the benchmark?
Ready to Give Your Child a Safe AI Experience?
Try HeyOtto today and see the difference parental peace of mind makes.


