AI IQ is right here: a brand new web site scores frontier AI fashions on the human IQ scale. The outcomes are already dividing tech.

For many years, the IQ check has been some of the acquainted — and most contested — yardsticks for human intelligence. Now, a startup undertaking known as AI IQ is making use of the identical metaphor to synthetic intelligence, assigning estimated intelligence quotients to greater than 50 of the world’s strongest language fashions and plotting them on a regular bell curve.

The result’s a set of interactive visualizations at aiiq.org which have ricocheted throughout social media prior to now week, drawing reward from enterprise technologists who say the charts make an impossibly complicated market legible — and sharp criticism from researchers and commentators who warn your entire framework is deceptive.

“That is tremendous helpful,” wrote Thibaut Mélen, a expertise commentator, on X. “A lot simpler to grasp mannequin progress when it is mapped like this as a substitute of one other large leaderboard desk.”

Brian Vellmure, a enterprise strategist, provided an identical endorsement: “That is useful. Anecdotally tracks with private expertise.”

However the backlash arrived simply as rapidly. “It is nonsense. AI is much too jagged. The map isn’t the territory,” posted AI Deeply, a synthetic intelligence commentary account, crystallizing a fear shared by many researchers: that decreasing a language mannequin’s sprawling, uneven capabilities to a single quantity creates a harmful phantasm of precision.

Greater than 50 AI language fashions, plotted on a regular IQ bell curve by the location AI IQ. Probably the most succesful fashions crowd the precise tail of the distribution. (Credit score: AI IQ)

Twelve benchmarks, 4 dimensions, and one controversial quantity: how AI IQ truly works

AI IQ was created by Ryan Shea, an engineer, entrepreneur, and angel investor finest generally known as a co-founder of the blockchain platform Stacks. Shea additionally co-founded Voterbase and has invested within the early levels of a number of unicorns, together with OpenSea, Lattice, Anchorage, and Mercury. He holds a Bachelor of Science in Mechanical Engineering from Princeton College.

The positioning’s methodology rests on a deceptively easy formulation. AI IQ teams 12 benchmarks into 4 reasoning dimensions: summary, mathematical, programmatic, and tutorial. The composite IQ is a straight common of these 4 dimension scores: IQ = ¼ (IQ_Abstract + IQ_Math + IQ_Prog + IQ_Acad).

The summary reasoning dimension attracts from ARC-AGI-1 and ARC-AGI-2, the notoriously troublesome pattern-recognition benchmarks designed to check normal fluid intelligence. Mathematical reasoning contains FrontierMath (Tiers 1–3 and Tier 4), AIME, and ProofBench. Programmatic reasoning makes use of Terminal-Bench 2.0, SWE-Bench Verified, and SciCode. Tutorial reasoning pulls from Humanity’s Final Examination, CritPt, and GPQA Diamond.

Every uncooked benchmark rating will get mapped to an implied IQ by what the location describes as “hand-calibrated problem curves.” Crucially, the methodology compresses ceilings for benchmarks thought-about simpler or extra prone to information contamination, stopping them from inflating scores above 100. Tougher, much less gameable benchmarks retain greater ceilings. The system additionally handles lacking information conservatively: fashions want scores on at the least two of the 4 dimensions to obtain a derived IQ, and when benchmarks are absent, the pipeline intentionally pulls scores down reasonably than up. The positioning states that “each derived IQ averages all 4 dimensions, so lacking protection can not make a mannequin look higher by omission.”

OpenAI leads the bell curve, however the hole between the highest AI fashions has by no means been smaller

As of mid-Might 2026, the AI IQ charts inform a narrative of speedy convergence on the high of the frontier — and widening variety within the tiers under.

Based on the Frontier IQ Over Time chart, GPT-5.5 from OpenAI presently sits on the peak of the bell curve, with an estimated IQ close to 136 — the very best of any mannequin tracked. It’s intently adopted by GPT-5.4 (roughly 131), Opus 4.7 from Anthropic (roughly 132), and Opus 4.6 (roughly 129). Google’s Gemini 3.1 Professional lands close to 131, making the highest cluster terribly tight.

That compression isn’t distinctive to AI IQ’s framework. Visible Capitalist, drawing from a separate Mensa-based rating by TrackingAI, not too long ago noticed the identical dynamic, noting that “the largest takeaway is how compressed the highest of the leaderboard has change into.” On that scale, Grok-4.20 Skilled Mode and GPT 5.4 Professional tied at 145, with Gemini 3.1 Professional at 141.

Under the frontier cluster, the AI IQ charts present a crowded midfield. Fashions from Chinese language labs — Kimi K2.6, GLM-5, DeepSeek-V3.2, Qwen3.6, MiniMax-M2.7 — bunch between roughly 112 and 118, making the cost-performance tier more and more aggressive for enterprise patrons who do not want the very best mannequin for each process. One X consumer, ovsky, famous that the information “confirms expertise with sonnet 4.6 being an absolute workhorse versus opus 4.5” — pointing to the way in which the charts can validate practitioner intuitions that headline rankings usually miss.

aiiq-frontier-iq-over-time-2026-05-13 — The trajectory of frontier AI fashions from October 2023 to mid-2026, as tracked by AI IQ. Supplier-colored step-lines join every lab’s flagship releases, exhibiting roughly 60 factors of estimated IQ enchancment in 30 months. (Credit score: AI IQ)

Why emotional intelligence scores have gotten the brand new battleground in AI mannequin rankings

What distinguishes AI IQ from most different benchmarking efforts is its inclusion of an “EQ” — emotional intelligence — rating. The positioning maps every mannequin’s EQ-Bench 3 Elo rating and Area Elo rating to an estimated EQ utilizing calibrated piecewise-linear scales, then takes a 50/50 weighted composite of the 2.

The EQ scores produce a meaningfully completely different rating than IQ alone. On the IQ vs. EQ scatter plot, Anthropic’s Opus 4.7 leads on EQ with a rating close to 132, pushing it into the upper-right quadrant — essentially the most fascinating place, signaling each excessive cognitive and excessive emotional intelligence. OpenAI’s GPT-5.5 and GPT-5.4 cluster within the high-IQ zone however lag barely on EQ. Google’s Gemini 3.1 Professional sits in a robust center place on each axes.

One notable methodological alternative has drawn consideration: EQ-Bench 3 is judged by Claude, an Anthropic mannequin, which the location acknowledges “creates potential scoring bias in favor of Anthropic fashions.” To right for this, AI IQ subtracts a 200-point Elo penalty from the EQ-Bench part for all Anthropic fashions earlier than mapping to implied EQ. The Area part is unaffected because it makes use of human judges. That self-correction is uncommon within the benchmarking world, and it suggests Shea is conscious of the methodological minefield he has entered. Nonetheless, the EQ dimension captures one thing IQ alone can not: the rising significance of conversational high quality, collaboration, and belief in fashions deployed for user-facing work.

aiiq-iq-vs-eq-2026-05-13 — Plotting IQ towards EQ reveals that the neatest fashions aren’t all the time essentially the most emotionally clever. Anthropic’s Opus 4.7 dominates the upper-right quadrant. (Credit score: AI IQ)

The AI cost-performance chart that enterprise patrons truly must see

Maybe essentially the most virtually helpful chart on the location isn’t the bell curve however the IQ vs. Efficient Price scatter plot. It maps every mannequin’s estimated IQ towards an “efficient price” metric — outlined because the token price for a process utilizing 2 million enter tokens and 1 million output tokens, multiplied by a utilization effectivity issue.

The chart reveals a well-recognized sample in enterprise expertise: one of the best fashions will not be all the time one of the best worth. GPT-5.5 and Opus 4.7 sit within the upper-left nook — excessive IQ, excessive price, with efficient per-task prices north of $30 and $50 respectively. In the meantime, fashions like GPT-5.4-mini, DeepSeek-V3.2, and MiniMax-M2.7 occupy a candy spot within the center: respectable IQ scores between 112 and 120, at efficient prices starting from roughly $1 to $5 per process. On the most cost-effective excessive, GPT-oss-20b (an open-source OpenAI mannequin) seems close to $0.20 efficient price with an IQ round 107 — doubtlessly essentially the most economical possibility for bulk classification or extraction workloads.

The positioning additionally affords a 3D visualization mapping IQ, EQ, and efficient price concurrently. A dashed line working by the dice factors towards the perfect: greater IQ, greater EQ, and decrease price. Fashions close to the “inexperienced finish” of that axis are stronger all-around offers; these close to the “purple finish” sacrifice functionality, price effectivity, or each. For CIOs looking at API invoices, the implication is obvious: the intelligence hole between a $50 mannequin and a $3 mannequin has narrowed sufficient that routing — utilizing costly fashions for exhausting issues and low cost ones for all the things else — is not elective. It’s the dominant structure for critical AI deployments.

Critics say AI’s “jagged” capabilities make a single IQ rating dangerously deceptive

The loudest objection to AI IQ is philosophical, and it cuts deep. Critics argue that collapsing a mannequin’s uneven capabilities right into a single rating obscures greater than it reveals.

“IQ as a proxy is fading — we’re seeing reasoning density spikes that do not map to g-factor,” posted Zaya, a expertise commentator, on X. “GPT-5.5 already hit saturation on MMLU-Professional, however nonetheless fails ClockBench 50% of the time.”

That commentary touches on what AI researchers name the “jaggedness” drawback: massive language fashions usually exhibit wildly uneven capabilities, excelling at graduate-level physics whereas failing at duties a baby might do. A composite rating can paper over these gaps.

Pressureangle, one other X consumer, posted a extra granular critique, calling out “full lack of transparency” and arguing the location by no means absolutely discloses how its calibration curves have been created or validated. In equity, AI IQ does record its 12 benchmarks and exhibits the form of every calibration curve in its methodology modal. However the uncooked information and exact mathematical transformations will not be revealed as open datasets — a niche that issues to researchers accustomed to completely reproducible strategies.

Others questioned the premise itself. “As ineffective as human IQ testing,” wrote haashim on X. Shubham Sharma, an AI and expertise author, provided a constructive different: “Why not having the Fashions take an official (MENSA-Grade) check? Would not this be essentially the most correct and most ‘human-comparable’ technique to benchmark intelligence?” That method already exists by TrackingAI, which administers the Mensa Norway IQ check to language fashions. However Mensa-style assessments measure solely summary sample recognition, whereas AI IQ makes an attempt a broader composite throughout coding, arithmetic, and tutorial reasoning. As Visible Capitalist famous, “an IQ-style benchmark captures just one slice of functionality.” Every method has tradeoffs — and neither has gained the argument but.

The actual race is not for the very best rating — it is for the neatest mannequin stack

For all the talk about methodology, crucial sign in AI IQ’s information is probably not any single mannequin’s rating. It’s the form of the market the charts reveal.

There are actually greater than 50 frontier-class fashions accessible by APIs, from at the least 14 main suppliers spanning america, China, and Europe. Every supplier publishes its personal benchmarks, usually cherry-picked to showcase strengths. The result’s a Tower of Babel the place no two firms measure the identical factor in the identical approach. Tutorial analysis has highlighted that “most benchmarks introduce bias by specializing in a specific kind of area,” and the Frontier IQ Over Time chart on AI IQ exhibits simply how briskly the targets are shifting: in October 2023, GPT-4-turbo sat close to an estimated IQ of 75. By early 2026, the highest fashions have been brushing 135 — roughly 60 factors of enchancment in 30 months.

That tempo raises a basic query about whether or not any scoring system can sustain. The positioning compresses ceilings for saturated benchmarks, however as fashions proceed to max out even the toughest assessments — ARC-AGI-2, FrontierMath Tier 4, Humanity’s Final Examination — the framework will face the identical ceiling results which have plagued each AI analysis earlier than it. Connor Forsyth pointed to this dynamic on X: “ARC AGI 3 disagrees,” he wrote, referencing a next-generation benchmark which will already be undermining present scores.

AI IQ isn’t good. Its methodology is partially opaque. Its IQ metaphor can mislead. And its creator acknowledges identified biases whereas possible lacking others. However the different — wading by dozens of provider-specific benchmark tables, every utilizing completely different check suites and scoring conventions — is worse. The positioning affords enterprise patrons one thing genuinely scarce: a single framework for evaluating fashions throughout suppliers, dimensions, and worth factors, up to date frequently, with sufficient nuance to indicate that the precise reply to “which mannequin is finest?” is sort of all the time “it is dependent upon the duty.”

As Debdoot Ghosh mused on X after viewing the charts: “Now a human’s position is simply to orchestrate?”

Perhaps. But when the AI IQ information exhibits something clearly, it’s that orchestration — figuring out which mannequin to deploy, when, and at what worth — has change into its personal type of intelligence. And for that, there is no such thing as a benchmark but.

AI IQ is right here: a brand new web site scores frontier AI fashions on the human IQ scale. The outcomes are already dividing tech.

Twelve benchmarks, 4 dimensions, and one controversial quantity: how AI IQ truly works

OpenAI leads the bell curve, however the hole between the highest AI fashions has by no means been smaller

Why emotional intelligence scores have gotten the brand new battleground in AI mannequin rankings

The AI cost-performance chart that enterprise patrons truly must see

Critics say AI’s “jagged” capabilities make a single IQ rating dangerously deceptive

The actual race is not for the very best rating — it is for the neatest mannequin stack

Leave a Reply Cancel reply

Follow US

Forex

Popular News

Finance foundational learning like the future depends on it

Study Finds Extreme Weather Database Exaggerates Global Disaster Trends

New pictures present bushfire devastation in Victoria as offshore gasoline drilling within the Otway Basin continues

Earnings Preview: Intuitive Surgical (ISRG) appears poised for a robust This fall

Methods to Lose Stomach Fats After 50 With Easy Habits

Categories

About US

Quick Link

Important Links

Subscribe US

Twelve benchmarks, 4 dimensions, and one controversial quantity: how AI IQ truly works

OpenAI leads the bell curve, however the hole between the highest AI fashions has by no means been smaller

Why emotional intelligence scores have gotten the brand new battleground in AI mannequin rankings

The AI cost-performance chart that enterprise patrons truly must see

Critics say AI’s “jagged” capabilities make a single IQ rating dangerously deceptive

The actual race is not for the very best rating — it is for the neatest mannequin stack

Leave a Reply Cancel reply

Follow US

Weekly Newsletter

Forex

Popular News

Finance foundational learning like the future depends on it

Study Finds Extreme Weather Database Exaggerates Global Disaster Trends

New pictures present bushfire devastation in Victoria as offshore gasoline drilling within the Otway Basin continues

Earnings Preview: Intuitive Surgical (ISRG) appears poised for a robust This fall

Methods to Lose Stomach Fats After 50 With Easy Habits

Categories

About US

Quick Link

Important Links

Subscribe US