Since its introduction, Anthropic’s Mythos AI model has been making headlines. In fact, it is so powerful that it can match cybersecurity experts when it comes to finding security vulnerabilities in software, such as browsers, operating systems, or even software used in the world of finance. At the moment, this model is only offered to certain organizations. But we already have a concrete idea of its performance.
For example, using this AI, the Mozilla Foundation discovered and fixed more than 400 vulnerabilities in Firefox. And METR, which has developed a method to evaluate the abilities of artificial intelligence models, indicates that Mythos is so efficient that it will be forced to update this method in order to accurately evaluate the level of this artificial intelligence.
AI too powerful for evaluation
This evaluation is based on the “duration” of the tasks. In essence, METR evaluates the time required for a human expert to complete a task (for example, a few seconds to answer a question, less than 6 minutes to check information on the web, or less than 16 hours to “reduce the size of a language model as much as possible”). The more efficient an AI is, the longer the “duration” of tasks it can perform with a 50% or 80% chance of success.
Having gained early access to Claude Mythos, METR evaluated this AI using this method. And the only thing that is certain is that this AI is capable of completing tasks that require more than 16 hours, with a 50% chance of success. But everything stops there. In fact, to more accurately measure Claude Mythos’s competencies, the assessment method should include new and even more complicated tasks. “Measurements longer than 16 hours are not reliable with our current task set”, we read on the METR website.
“Of the 228 tasks in our ensemble, only 5 are estimated to take more than 16 hours, making measurements in this range unstable and less meaningful than in ranges where task coverage is better. Therefore, we do not highlight exact estimates for models that exceed the 16 hours measured with our current ensemble.”also indicates METR in X.
Of the 228 tasks in our set, only 5 are estimated to last longer than 16 hours, making measurements in this range unstable and less significant than in ranges with better task coverage. Therefore, we do not highlight exact estimates for models longer than 16 hours measured with our current ensemble. pic.twitter.com/VJK9prGffE
— METR (@METR_Evals) May 8, 2026
🟣 To not miss any news from Woozad, follow us on Google and our WhatsApp channel. And if you love us, we have a newsletter every morning.