Microsoft on Thursday launched three new foundational AI fashions it constructed completely in-house — a state-of-the-art speech transcription system, a voice technology engine, and an upgraded picture creator — marking essentially the most concrete proof but that the $3 trillion software program big intends to compete instantly with OpenAI, Google, and different frontier labs on mannequin growth, not simply distribution.
The trio of fashions — MAI-Transcribe-1, MAI-Voice-1, and MAI-Picture-2 — can be found instantly by way of Microsoft Foundry and a brand new MAI Playground. They span three of essentially the most commercially useful modalities in enterprise AI: changing speech to textual content, producing sensible human voice, and creating pictures. Collectively, they characterize the opening salvo from Microsoft’s superintelligence crew, which Suleyman fashioned simply six months in the past to pursue what he calls “AI self-sufficiency.”
“I am very excited that we have now bought the primary fashions out, that are the perfect on the planet for transcription,” Suleyman instructed VentureBeat in an interview forward of the general public announcement. “Not solely that, we’re in a position to ship the mannequin with half the GPUs of the state-of-the-art competitors.”
The announcement lands at a precarious second for Microsoft. The corporate’s inventory simply closed its worst quarter for the reason that 2008 monetary disaster, as buyers more and more demand proof that lots of of billions of {dollars} in AI infrastructure spending will translate into income. These fashions — priced aggressively and positioned to cut back Microsoft’s personal value of products offered — are Suleyman’s first reply to that strain.
Microsoft’s new transcription mannequin claims best-in-class accuracy throughout 25 languages
MAI-Transcribe-1 is the headline launch. The speech-to-text mannequin achieves the bottom common Phrase Error Charge on the FLEURS benchmark — the industry-standard multilingual check — throughout the highest 25 languages by Microsoft product utilization, averaging 3.8% WER. In line with Microsoft’s benchmarks, it beats OpenAI’s Whisper-large-v3 on all 25 languages, Google’s Gemini 3.1 Flash on 22 of 25, and ElevenLabs’ Scribe v2 and OpenAI’s GPT-Transcribe on 15 of 25 every.
The mannequin makes use of a transformer-based textual content decoder with a bi-directional audio encoder. It accepts MP3, WAV, and FLAC recordsdata as much as 200MB, and Microsoft says its batch transcription pace is 2.5 occasions sooner than the present Microsoft Azure Quick providing. Diarization, contextual biasing, and streaming are listed as “coming quickly.” Microsoft is already testing MAI-Transcribe-1 inside Copilot’s Voice mode and Microsoft Groups for dialog transcription — a element that underscores how rapidly the corporate intends to switch third-party or older inside fashions with its personal.
Alongside it, MAI-Voice-1 is Microsoft’s text-to-speech mannequin, able to producing 60 seconds of natural-sounding audio in a single second. The mannequin preserves speaker id throughout long-form content material and now helps customized voice creation from only a few seconds of audio by way of Microsoft Foundry. Microsoft is pricing it at $22 per 1 million characters. MAI-Picture-2, in the meantime, debuted as a top-three mannequin household on the Enviornment.ai leaderboard and now delivers no less than 2x sooner technology occasions on Foundry and Copilot in comparison with its predecessor. Microsoft is rolling it out throughout Bing and PowerPoint, pricing it at $5 per 1 million tokens for textual content enter and $33 per 1 million tokens for picture output. WPP, one of many world’s largest promoting holding firms, is among the many first enterprise companions constructing with MAI-Picture-2 at scale.
The contract renegotiation with OpenAI that made Microsoft’s mannequin ambitions attainable
To know why these fashions matter, you must perceive the contractual tectonic shift that made them attainable. Till October 2025, Microsoft was contractually prohibited from independently pursuing synthetic common intelligence. The unique take care of OpenAI, signed in 2019, gave Microsoft a license to OpenAI’s fashions in trade for constructing the cloud infrastructure OpenAI wanted. However when OpenAI sought to broaden its compute footprint past Microsoft — hanging offers with SoftBank and others — Microsoft renegotiated. As Suleyman defined in a December 2025 interview with Bloomberg, the revised settlement meant that “up till just a few weeks in the past, Microsoft was not allowed — by contract — to pursue synthetic common intelligence or superintelligence independently.” The brand new phrases freed Microsoft to construct its personal frontier fashions whereas retaining license rights to all the things OpenAI builds by way of 2032.
Suleyman described the dynamic to VentureBeat in characteristically blunt phrases. “Again in September of final 12 months, we renegotiated the contract with OpenAI, and that enabled us to independently pursue our personal superintelligence,” he stated. “Since then, we have been convening the compute and the crew and shopping for up the information that we’d like.”
He was fast to emphasise that the OpenAI partnership stays intact. “Nothing’s altering with the OpenAI partnership. We can be in partnership with them no less than till 2032 and hopefully lots longer,” Suleyman stated. “They’ve been an exceptional accomplice to us.” He additionally highlighted that Microsoft gives entry to Anthropic’s Claude by way of its Foundry API, framing the corporate as “a platform of platforms.” However the subtext is unmistakable: Microsoft is constructing the aptitude to face by itself. In March, as Enterprise Insider first reported, Suleyman wrote in an inside memo that his aim is to “focus all my power on our Superintelligence efforts and have the ability to ship world class fashions for Microsoft over the subsequent 5 years.” CNBC reported that the structural shift freed Suleyman from day-to-day Copilot product obligations, with former Snap government Jacob Andreou taking on as EVP of the mixed shopper and industrial Copilot expertise.
How groups of fewer than 10 engineers constructed fashions that rival Large Tech’s greatest
Maybe essentially the most hanging element Suleyman shared with VentureBeat is how small the groups behind these fashions really are. “The audio mannequin was constructed by 10 folks, and the overwhelming majority of the pace, effectivity and accuracy positive factors come from the mannequin structure and the information that we’ve used,” Suleyman stated. “My philosophy has all the time been that we’d like fewer people who find themselves extra empowered. So we function a particularly flat construction.” He added: “Our picture crew, equally, is lower than 10 folks. So that is all about mannequin and information innovation, which has delivered state-of-the-art efficiency.”
This issues for 2 causes. First, it challenges the prevailing {industry} narrative that frontier AI growth requires hundreds of researchers and billions in headcount prices. Meta, in contrast, has pursued what Suleyman described in his Bloomberg interview as a method of “hiring a number of people, fairly than possibly making a crew” — together with reported compensation packages of $100 million to $200 million for high researchers. Second, small groups producing state-of-the-art outcomes dramatically enhance the economics. If Microsoft can construct best-in-class transcription with 10 engineers and half the GPUs of rivals, the margin construction of its AI enterprise appears essentially totally different from firms burning by way of money to realize related benchmarks.
The lean-team philosophy additionally echoes Suleyman’s broader views on how AI is already reshaping the work of constructing AI itself. When requested by VentureBeat how his personal crew works, Suleyman described an setting that resembles a startup buying and selling ground greater than a standard Microsoft engineering org. “There are teams of individuals round spherical tables, round tables, not conventional desks, on laptops as a substitute of massive screens,” he stated. “They’re mainly vibe coding, aspect by aspect all day, morning until evening, in rooms of fifty or 60 folks.”
Why Suleyman’s “humanist AI” pitch is aimed squarely at enterprise consumers
Suleyman has been steadily constructing a philosophical model round Microsoft’s AI efforts that he calls “humanist AI” — a time period that appeared prominently within the weblog put up he authored for the launch and that he elaborated on in our interview. “I feel that the motivation of a humanist tremendous intelligence is to create one thing that’s actually in service of humanity,” he instructed VentureBeat. “People will stay in management on the high of the meals chain, and they are going to be all the time aligned to human pursuits.”
The framing serves a number of functions. It differentiates Microsoft from the extra acceleration-oriented rhetoric coming from OpenAI and Meta. It resonates with enterprise consumers who want governance, compliance, and security assurances earlier than deploying AI in regulated industries. And it gives a story hedge: if one thing goes unsuitable within the broader AI ecosystem, Microsoft can level to its acknowledged dedication to human management. In his December Bloomberg interview, Suleyman went additional, describing containment and alignment as “purple strains” and arguing that nobody ought to launch a superintelligence device till they’re “assured it may be managed.”
Suleyman additionally harassed information provenance as a aggressive benefit, describing a dialog with CEO Satya Nadella about growing “a clear lineage of fashions the place the information is extraordinarily clear.” He drew an implicit distinction with open-source alternate options, noting that “lots of the open-source fashions have been skilled on information in, for instance, inappropriate methods. And there are doubtlessly safety points with that.” For enterprise prospects evaluating AI distributors amid a thicket of copyright lawsuits throughout the {industry}, that may be a significant industrial argument — if Microsoft can credibly declare that its coaching information was acquired by way of correctly licensed channels, it reduces the authorized and reputational threat of deploying these fashions in manufacturing.
Microsoft’s aggressive pricing places strain on Amazon, Google, and the AI startup ecosystem
Right now’s launch positions Microsoft on three aggressive fronts concurrently. MAI-Transcribe-1 instantly targets the transcription workloads that OpenAI’s Whisper fashions have dominated within the open-source group, with Microsoft claiming superior accuracy on all 25 benchmarked languages. The FLEURS outcomes additionally present it profitable in opposition to Google’s Gemini 3.1 Flash Lite on 22 of 25 languages — a direct problem as Google aggressively pushes Gemini throughout its personal product suite. And MAI-Voice-1’s capacity to clone voices from seconds of audio and generate speech at 60x real-time places it in competitors with ElevenLabs, Resemble AI, and the rising ecosystem of voice AI startups, with Microsoft’s distribution benefit — any Foundry developer can now entry these capabilities by way of the identical API they use for GPT-4 and Claude — appearing as a strong moat.
Suleyman framed the aggressive place confidently: “We’re now a high three lab just below OpenAI and Gemini,” he instructed VentureBeat. The pricing technique — MAI-Voice-1 at $22 per million characters, MAI-Picture-2 at $5 per million enter tokens — displays a deliberate resolution to compete on value. “We’re pricing them to be the perfect of any hyperscaler. So there would be the most cost-effective of any of the hyperscalers on the market, Amazon. And clearly Google,” Suleyman stated. “And that is a really aware resolution.”
This makes strategic sense for Microsoft, which might amortize mannequin growth prices throughout its huge put in base of enterprise prospects. However it additionally speaks to the query buyers have been asking with growing urgency: when does AI spending begin producing returns? Microsoft’s inventory has fallen roughly 17% year-to-date, in response to CNBC, a part of a broader selloff in software program shares. By constructing fashions that run on half the GPUs of rivals, Microsoft reduces its personal infrastructure prices for inside merchandise — Groups, Copilot, Bing, PowerPoint — whereas providing builders pricing designed to undercut the remainder of the market. In his March memo, Suleyman wrote that his fashions would “allow us to ship the COGS efficiencies mandatory to have the ability to serve AI workloads on the immense scale required within the coming years.” These three fashions are the primary tangible supply on that promise.
Suleyman says a frontier massive language mannequin is coming — and Microsoft plans to be “fully impartial”
Suleyman made clear that transcription, voice, and picture technology are only the start. When requested whether or not Microsoft would construct a big language mannequin to compete instantly with GPT on the frontier degree, he was unequivocal. “We completely are going to be delivering state-of-the-art fashions throughout all modalities,” he stated. “Our mission is to ensure that if Microsoft ever wants it, we can present state-of-the-art at one of the best effectivity, the most cost effective value, and be fully impartial.”
He described a multi-year roadmap to “arrange the GPU clusters on the applicable scale,” noting that the superintelligence crew was formally stood up solely in October 2025. Suleyman spoke to VentureBeat from Miami, the place the total crew was convening for one in every of its common week-long in-person periods. He described Nadella flying in for the gathering to put out “the roadmap of all the things that we have to obtain for our AI self-sufficiency mission over the subsequent 2, 3, 4 years, and all of the compute roadmap that that might contain.”
Constructing a aggressive frontier LLM, after all, is a unique order of magnitude in complexity, information necessities, and compute value from what Microsoft demonstrated Thursday. The fashions launched at present are specialised — they deal with audio and pictures, not the overall reasoning and textual content technology that underpin merchandise like ChatGPT or Copilot’s core intelligence. Suleyman has the organizational mandate, Nadella’s public backing, and the contractual freedom. What he would not but have is a observe file at Microsoft of delivering on the toughest downside in AI.
However think about what he does have: three fashions which might be best-in-class or close to it of their respective domains, constructed by groups smaller than most seed-stage startups, operating on half the industry-standard GPU footprint, and priced beneath each main cloud competitor. Two years in the past, Suleyman proposed in MIT Expertise Overview what he referred to as the “Trendy Turing Check” — not whether or not AI might idiot a human in dialog, however whether or not it might exit into the world and achieve actual financial duties with minimal oversight. On Thursday, his personal fashions took a step towards that imaginative and prescient. The query now’s whether or not Microsoft’s superintelligence crew can repeat the trick on the scale that really issues — and whether or not they can do it earlier than the market’s persistence runs out.

