Gemini Ultra vs ChatGPT-4 | What Microsoft says?

Microsoft puts GPT-4 ahead of Gemini Ultra again, using Google’s own tricks

Microsoft claims that GPT-4, when combined with a special prompting strategy, outperforms Google Gemini Ultra in the MMLU (Measuring Massive Multitask Language Understanding) benchmark.

Microsoft recently introduced Medprompt, a prompting strategy originally developed for medical challenges. Microsoft researchers discovered, however, that it is also suitable for more general applications.

Microsoft has achieved a new State-of-the-Art (SoTA) score on the MMLU benchmark by running GPT-4 with a modified version of Medprompt.

Microsoft’s announcement is interesting because Google highlighted Ultra’s new top MMLU score during the launch of its new Gemini AI system last week.

Microsoft tricks back: Complex prompts improve benchmark performance

Google’s messaging at the time of Gemini’s launch was a little misleading: the Ultra model achieved the best MMLU benchmark result to date, but with a more complex prompting strategy than is typical in this benchmark. Gemini Ultra outperforms GPT-4 when using the standard prompting strategy (5 shots).

Microsoft now reports GPT-4 performance in the MMLU with Medprompt+ at a record high of 90.10 percent, surpassing Gemini Ultra’s 90.04 percent.


To achieve this result, Microsoft researchers extended Medprompt to Medprompt+ by including a simpler prompt method in Medprompt and developing a strategy for combining answers from both the basic Medprompt strategy and the simpler prompt method.

The MMLU Benchmark is an extensive general knowledge and reasoning test. It includes tens of thousands of items from 57 different subject areas, such as mathematics, history, law, computer science, engineering, and medicine. Many consider it the most important benchmark for language models.

When Microsoft measures performance, GPT-4 outperforms Gemini Ultra on even more benchmarks

In addition to the MMLU benchmark, Microsoft has published results for other benchmarks that compare GPT-4 to Gemini Ultra using simple prompts that are common to these benchmarks. GPT-4 is said to outperform Gemini Ultra in this measurement method in several benchmarks, including GSM8K, MATH, HumanEval, BIG-Bench-Hard, DROP, and HellaSwag.


Medprompt and other similar prompting strategies are available on GitHub under the name Promptbase. The repository contains scripts, general tools, and information to aid in reproducing the results and improving the base models’ performance.

Microsoft and Google primarily use the mostly minor differences in the benchmarks for public relations purposes, and they are unlikely to matter much in practice. However, Microsoft is emphasizing here what was already obvious when Ultra was announced: the two models are on par.

This could indicate that OpenAI is either ahead of Google, or that developing a much more capable LLM than GPT-4 is extremely difficult. As Bill Gates recently suggested, LLM technology in its current form may have reached its limits. GPT-4.5 or GPT-5 from OpenAI may be able to help here.

Must Read Articles


Leave a Reply

Your email address will not be published. Required fields are marked *