Baby AGI, Kid AGI, and the Locked Nursery Behind the Demo
Lede
The problem is not that AI is useless, but that the public still gets Baby AGI while something sharper may already be learning house rules upstairs.
Hermit Off Script
I was cursing AI models less than a year ago, and I hope I will curse them less next year, or maybe finally get that proper “wow” moment. Right now they still feel like children with more knowledge than they can actually sustain. They know a lot, sometimes too much, but they do not carry it with enough judgement. One minute they look smart, the next they do something so daft you would not trust them to sort a cupboard, never mind write code or take actions on their own. Yes, public models are better than a year ago. Capability is moving fast. But better is not the same as trustworthy, and that is where the comedy still lives. OpenAI itself says hallucinations remain a fundamental challenge for all large language models, which is a polished way of saying the machine can still sound clever while making things up with a straight face. Part of this is probably limits – processing power, memory, product restrictions, and plain old cost. Stanford HAI shows how quickly training costs have exploded, from about 670 dollars for the original Transformer to about 79 million dollars for GPT-4 and about 170 million dollars for Llama 3.1-405B. That alone tells you why the best toys may not be handed out to everyone at once. Then the plot gets sharper. Anthropic has kept Claude Mythos Preview behind a gate, saying it has already identified thousands of zero-day vulnerabilities across critical infrastructure, while extending access to over 40 additional organisations and committing up to 100 million dollars in usage credits through Project Glasswing. Reuters says the rollout is restricted for defensive cybersecurity work. AISI says frontier models have now started completing expert-level cyber tasks that usually take more than 10 years of human experience. And Axios says OpenAI’s cybersecurity product for select partners is separate from an upcoming model reportedly called Spud, with Spud’s cyber role still unclear. So the cleaner truth is this: the public still gets Baby AGI for the demo, while Kid AGI may already be upstairs behind a locked door.
What does not make sense
We are told the future is here, yet Anthropic is still keeping Mythos in a gated research preview and limiting access to selected defenders.
We are told current public models are ready to be trusted with more autonomy, yet OpenAI still says hallucinations remain a fundamental challenge.
We are told benchmark wins prove maturity, yet OpenAI says SWE-bench Verified is increasingly contaminated and no longer a good frontier measure.
We are told the mainstream version is basically the same as the serious one, yet both Anthropic and OpenAI are now using gated or tiered access for stronger cyber capability.
We are told cost is not the real issue, yet frontier training costs have gone from hundreds of dollars to tens or hundreds of millions. Funny how the nursery is always cheaper than the penthouse.
Sense check / The numbers
The original Transformer cost about 670 dollars to train in 2017, GPT-4 about 79 million dollars in 2023, and Llama 3.1-405B about 170 million dollars in 2024. [Stanford HAI]
OpenAI says GPT-5 has fewer hallucinations, but that hallucinations still occur and remain a fundamental challenge for all large language models. [OpenAI]
In cyber testing, AISI says frontier models now complete apprentice-level tasks about 50 per cent of the time on average, up from just over 10 per cent in early 2024, and in 2025 it tested the first model able to complete expert-level tasks usually requiring over 10 years of human experience. [AISI]
Anthropic says Mythos Preview has already identified thousands of zero-day vulnerabilities across critical infrastructure, while Project Glasswing has extended access to over 40 additional organisations and carries up to 100 million dollars in usage credits. [Anthropic]
OpenAI says its Trusted Access for Cyber pilot includes 10 million dollars in API credits for defensive work, which tells you the stronger tools are not being handed out like party favours. [OpenAI]
OpenAI says SWE-bench Verified rose from 74.9 per cent to 80.9 per cent in six months, then says the benchmark is increasingly contaminated, while Scale’s public SWE-Bench Pro page says top models are only around 23 per cent on the harder test. [OpenAI] [Scale Labs]
Axios reports that OpenAI’s select-partner cybersecurity product is separate from an upcoming model reportedly called Spud, and says Spud’s cyber capabilities and rollout are still unclear. [Axios]
The sketch
Scene 1: The Nursery of the Future Panel description: A glossy tech launch. A giant chrome robot stands on stage in a tiny school blazer while executives clap beneath a banner reading “AUTONOMY”. Dialogue: “We have built the digital adult.” “It just asked where the save button lives.”
Scene 2: The Benchmark Baptism Panel description: Scientists in lab coats lower a model into a glowing font labelled “VERIFIED”, while a harder exam paper marked “PRO” sits ignored in the corner. Dialogue: “It scored magnificently.” “On the test it may already have seen, yes.”
Scene 3: Enterprise Heaven, Public Beta Earth Panel description: A velvet-rope lobby. Suits walk through a gold door marked “Private AGI”, while ordinary users queue outside with rate limits, pop-ups and a smoking error log. Dialogue: “The future is here.” “Wonderful. Shame it lives upstairs.”
What to watch, not the show
Money, because training and deployment cost still decide who gets the serious model and who gets the smiling showroom version.
Incentives, because public bragging still leans on benchmarks that can flatter models more than they test them.
Gated access, because both Anthropic and OpenAI are already drawing lines around stronger cyber capability.
Media habits, because every jump gets sold like arrival, when most of the real story is still in the restrictions, caveats, and safety notes.
Long-term risk, because AISI’s curve is moving fast and the length of cyber tasks models can complete unassisted is doubling roughly every eight months.
Hallucinations, because “hopefully less next year” is still a prayer, not a guarantee.
The Hermit take
Baby AGI is for the applause. Kid AGI may already be upstairs, and we are still being sold the nursery brochure.
Keep or toss
Keep / Toss Keep the progress and the caution. Toss the fantasy that a polished demo means adult judgement.
Sources
Stanford HAI, AI Index 2025 on training cost, agent benchmarks and frontier convergence: https://hai.stanford.edu/ai-index/2025-ai-index-report
OpenAI on hallucinations: https://openai.com/index/why-language-models-hallucinate/
OpenAI on Trusted Access for Cyber: https://openai.com/index/trusted-access-for-cyber/
OpenAI on SWE-Bench Verified contamination: https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/
Anthropic on Project Glasswing and Mythos Preview: https://www.anthropic.com/project/glasswing
Reuters on Anthropic’s restricted Mythos rollout: https://www.reuters.com/legal/litigation/anthropic-touts-ai-cybersecurity-project-with-big-tech-partners-2026-04-07/
UK AI Security Institute on rising frontier cyber capability: https://www.aisi.gov.uk/frontier-ai-trends-report
Scale Labs on SWE-Bench Pro public leaderboard results: https://labs.scale.com/leaderboard/swe_bench_pro_public
Axios on OpenAI’s restricted cyber product and the reported Spud model: https://www.axios.com/2026/04/09/openai-new-model-cyber-mythos-anthopic
Satire and commentary. Opinion pieces for discussion. Sources at the end. Not legal, medical, financial, or professional advice.
Leave a Reply