← Back to Learn

AI Just Scored 136 on an IQ Test. The Offline Data Tells a Different Story.

GPT-5.5 hit 136 on an AI IQ leaderboard. On a private test it has never seen before, the score drops 20 to 40 points. Here is what that gap actually means.

IQ & Intelligence/June 3, 2026/7 min read
AI Just Scored 136 on an IQ Test. The Offline Data Tells a Different Story.

The Number That Spread Everywhere

In May 2026, a leaderboard at aiiq.org showed GPT-5.5 with an estimated IQ of 136. Within days the number was on Reddit, LinkedIn, and X. Headlines followed. "AI has surpassed human genius." "The age of cognitive supremacy is over."

We looked at the methodology. What we found is more interesting than any of those takes.

Where the 136 Actually Comes From

aiiq.org was built by Ryan Shea, an engineer and entrepreneur. The site does not administer a traditional IQ test to AI models. Instead, it aggregates performance across roughly 18 public benchmarks, including ARC Prize and FrontierMath, maps those results onto five reasoning dimensions, and converts the composite into a human IQ-scale score using calibrated difficulty curves and a normal distribution.

The result is an estimate. A carefully constructed one. But an estimate.

Shea's methodology is published on the site. The calibration is documented. The 136 is not invented. It is also not the same as sitting a model down with a WAIS-IV or a proctored Mensa assessment. Those are different instruments measuring different things, and conflating them is where most of the hot takes go wrong.

The More Direct Test

A second project, TrackingAI.org, run by journalist Maxim Lott, takes a more hands-on approach. Lott actually feeds AI models the public Mensa Norway test. 35 visual pattern recognition puzzles. 25-minute time limit. The same test any person can take right now at test.mensa.no.

GPT-5.5 scores high 130s to low 140s on that test, depending on which mode is used. The human ceiling on the Mensa Norway test is 145. The top 1 to 2 percent of humans score 130 or above. A score around 131 is roughly where Mensa qualification territory begins. We break down what an IQ of 130 means for humans specifically if you want that context.

By those numbers, GPT-5.5 sits near the very top of the human distribution. The data looks unambiguous.

There is one more number. It changes the picture significantly.

The Part Getting Almost No Coverage

TrackingAI also runs an offline test. Private questions. Problems that have never appeared on the public internet, meaning they almost certainly do not exist in any model's training data.

On that test, AI scores drop by 20 to 40 points.

That is not a rounding error. Twenty to forty points is the difference between the 99th percentile and the 50th. Between Mensa-qualifying and population average.

GPT-5.5: Public vs. Offline IQ Score

Source: aiiq.org (benchmark composite) · TrackingAI.org (private offline test)

Public Leaderboard (aiiq.org) 136
Top 1% of humans
Private Offline Test (TrackingAI) 96–116
Average to above average
-20 to -40 pts when tested on novel questions it has never encountered in training data

The Mensa Norway test has been publicly available for years. Every frontier AI model was trained on billions of web pages. The probability that GPT-5.5 encountered the Mensa Norway test, its questions, and its answer patterns during training is high. When it scores 140 on that public test, the score reflects two things at once: genuine reasoning ability, and pattern matching against memorised material. The offline test is the only measurement that separates those two. And on the offline test, the lead shrinks dramatically.

We found this more significant than anything else in the current AI IQ conversation. It has received almost no coverage in proportion to its importance.

Discover Your IQ Score

Free 36-question assessment. Instant results. No sign-up required.

Take the Free IQ Test →

What AI Is Genuinely Good At

None of this makes the capabilities unimpressive. They are genuinely impressive.

On formal pattern recognition, mathematical reasoning, and symbolic logic, frontier models reach levels most humans do not. GPT-5.5 reportedly scored 51.7 percent on FrontierMath Tier 1-3, a benchmark explicitly designed to resist easy answers. These are real evaluations producing real numbers, not extrapolated marketing claims.

AI also does not experience test anxiety or cognitive fatigue. It does not rush, does not miscount rows under time pressure, does not have an off day. On the matrix reasoning and pattern recognition tasks that form the backbone of tests like the Mensa Norway assessment, it processes structures faster than any human and makes fewer careless errors. Those strengths are real.

Trained patterns versus fluid reasoning — two contrasting brain states
Pattern matching against memorised training data (left) versus fluid reasoning on genuinely novel problems (right). The offline test measures the latter — and that is where the gap between AI and human performance narrows significantly.

Where the Advantage Disappears

Novelty is the variable. On genuinely new problems, ones that could not have appeared in training data, AI performance falls faster and further than it does for high-IQ humans.

The offline test data makes this concrete. A 20 to 40 point drop when the questions are novel is a significant finding. It points directly to the distinction between fluid and crystallised intelligence. Fluid intelligence is specifically the ability to reason through new problems without relying on prior knowledge. Crystallised intelligence is applying accumulated knowledge. The benchmarks that produced the 136 reward performance on problems that are, by definition, already in the public domain. The offline test targets something closer to fluid reasoning. That is where the gap opens up.

The strongest real-world correlates of IQ, things like performance under uncertainty and adaptation to unfamiliar problems, come from fluid reasoning. Not from pattern matching against material encountered in training. The offline test gap is not a technicality. It is the measurement that matters most for any honest comparison.

What Your Score Actually Measures

If you have taken an IQ test, the conditions were fundamentally different from the ones producing the AI leaderboard numbers.

A valid IQ assessment presents you with problems you have not seen before, under time pressure, without access to any reference material. Those are precisely the conditions under which AI performs worst. The offline test drop is not an anomaly. It is the closest approximation we have to a genuinely fair comparison between human and AI reasoning.

Your score on a properly designed test reflects your ability to reason through novel material. Not your ability to recognise patterns from problems you have almost certainly encountered before. That distinction is significant, and it is the one missing from most of the coverage we have seen.

GPT-5.5 may reach 136 on a benchmark leaderboard built on public data. On questions it has never seen, it likely performs much closer to average. That is worth knowing before drawing conclusions about where human and artificial intelligence actually stand relative to each other.

If you want to see where your own reasoning sits on a test built entirely on novel problems, our free IQ test uses 36 questions you will not have seen before. No training data advantage possible. Just you and the problems.

That is still the real test.

AJ Dorey

AJ Dorey

Founder & Researcher, IQScore

AJ Dorey is an English developer and cognitive science researcher. He built IQScore because most online IQ tests are broken — they either inflate scores to keep people happy or bury results behind a paywall after 20 minutes of questions.

Curious where you actually rank?

Free IQ test · 36 questions · Instant results · No sign-up

Start Free IQ Test →

Already know your score? Convert it to a percentile →