
Gemini 3 is Google’s latest AI model
VCG via Getty Images
Google’s latest chatbot, Gemini 3, has made significant leaps on a raft of benchmarks designed to measure AI progress, according to the company. These achievements may be enough to allay fears of an AI bubble bursting for the moment, but it is unclear how well these scores translate to real-world capabilities.
What’s more, persistent factual inaccuracies and hallucinations that have become a hallmark of all large language models show no signs of being ironed out, which could prove problematic for any uses where reliability is vital.
In a blog post announcing the new model, Google bosses Sundar Pichai, Demis Hassabis and Koray Kavukcuoglu write that Gemini 3 has “PhD-level reasoning”, a phrase that competitor OpenAI also used when it announced its GPT-5 model. As evidence for this, they list scores on several tests designed to test “graduate-level” knowledge, such as Humanity’s Last Exam, a set of 2500 research-level questions from maths, science and the humanities. Gemini 3 scored 37.5 per cent on this test, outclassing the previous record holder, a version of OpenAI’s GPT-5, which scored 26.5 per cent.
Jumps like this can indicate that a model has become more capable in certain respects, says Luc Rocher at the University of Oxford, but we need to be careful about how we interpret these results. “If a model goes from 80 per cent to 90 per cent on a benchmark, what does it mean? Does it mean that a model was 80 per cent PhD level and now is 90 per cent PhD level? I think it’s quite difficult to understand,” they say. “There is no number that we can put on whether an AI model has reasoning, because this is a very subjective notion.”
Benchmark tests have many limitations, such as requiring a single answer or multiple choice answers for which models don’t need to show their working. “It’s very easy to use multiple choice questions to grade [the models],” says Rocher, “but if you go to a doctor, the doctor will not assess you with a multiple choice. If you ask a lawyer, a lawyer will not give you legal advice with multiple choice answers.” There is also a risk that the answers to such tests were hoovered up in the training data of the AI models being tested, effectively letting them cheat.
The real test for Gemini 3 and the most advanced AI models – and whether their performance will be enough to justify the trillions of dollars that companies like Google and OpenAI are spending on AI data centres – will be in how people use the model and how reliable they find it, says Rocher.
Google says the model’s improved capabilities will make it better at producing software, organising email and analysing documents. The firm also says it will improve Google search by supplementing AI-generated results with graphics and simulations.
Initial reactions online have included people praising Gemini’s coding capabilities and ability to reason, but as with all new model releases, there have also been posts highlighting failures to do apparently simple tasks, such as tracing hand-drawn arrows pointing to different people, or simple visual reasoning tests.
Google admits, in Gemini 3’s technical specifications, that the model will continue to hallucinate and produce factual inaccuracies some of the time, at a rate that is roughly comparable with other leading AI models. The lack of improvement in this area is a big concern, says Artur d’Avila Garcez at City St George’s, University of London. “The problem is that all AI companies have been trying to reduce hallucinations for more than two years, but you only need one very bad hallucination to destroy trust in the system for good,” he says.
Topics: