ChatGPT 5 vs Gemini vs Claude vs Grok: Ultimate AI comaparison

Table Of Content
- ChatGPT 5 vs Gemini vs Claude vs Grok: The Ultimate Head-to-Head Test
- ChatGPT 5 vs Gemini vs Claude vs Grok Quick Comparison
- ChatGPT 5 vs Gemini vs Claude vs Grok Comparison
- Test 1: Build an Interactive Website Inside the Canvas
- Test 2: Vision + Reasoning Puzzle
- Test 3: Prompt Stress Test (Strict Formatting)
- Test 4: Hallucination Checks
- Test 5: How-To Productivity (Google Sheets Shortcut)
- Test 6: Forecasting and Table Design (24-Month Revenue Projection)
- Test 7: Generate and Solve a Maze (Animated Shortest Path)
- Test 8: Spreadsheet Formula Extraction
- Test 9: Everyday Math
- Test 10: Information Sorting (Organize Long Notes Into Categories)
- Follow-Up Prompting: Self-Scoring Behavior
- Individual Tool Analysis
- ChatGPT 5 (Thinking Model)
- Gemini Pro (Reasoning)
- Grok (Expert Mode)
- Claude Opus 4.1
- Head-to-Head Results Snapshot
- Pros and Cons
- ChatGPT 5
- Gemini Pro
- Grok
- Claude Opus 4.1
- Use Cases
- Pricing Comparison
- Final Verdict
ChatGPT 5 vs Gemini vs Claude vs Grok: The Ultimate Head-to-Head Test
I put four top large language models through a rigorous, real-world comparison across 10 categories. I tested reasoning, coding, hallucination control, UI generation inside their canvases, math, data handling, and follow-up prompt reliability. I ran each prompt in the same order, gave direct scores where applicable, and noted clear wins and misses.
The models:
- ChatGPT 5 (thinking model enabled)
- Gemini Pro (reasoning mode)
- Grok (expert mode)
- Claude Opus 4.1
All four required paid access. I scored individual challenges from 1 to 10, then tallied the outcomes at the end. Where a prompt or test produced all wrong answers, I excluded it from the final count.

This article condenses that full test into a structured, practical guide that mirrors the original flow, keeps only what matters, and shows exactly where each model stood out or fell short.
ChatGPT 5 vs Gemini vs Claude vs Grok Quick Comparison
| Model | Mode used | Standout strengths from tests | Notable gaps |
|---|---|---|---|
| ChatGPT 5 | Thinking | Solid UI generation (though imperfect tool selection); strong how-to guidance; handled hallucination checks; passed maze, math, and spreadsheet formula tasks | Missed prompt intent on forecasting; needed follow-ups to create interactive tables; odd detours on information sorting |
| Gemini Pro | Reasoning | Built interactive interfaces and tables; strong information sorting; handled hallucination prompts carefully; clean follow-up behavior and impartial self-evaluation | Missed vision puzzle; how-to guidance led with an inefficient path; forecasting contained wrong assumptions and math errors |
| Grok | Expert | Chose the correct set of AI tools in the UI test; solved vision puzzle; gave the right shortcut first; passed maze and formula tasks | UI was rough; mobile layout broken; forecasting rested on made-up assumptions; information sorting produced a script, not a clean list |
| Claude Opus 4.1 | — | Best overall UX in canvas tasks; top performer on maze generation and solving; strongest forecasting output of the four; very good info sorting | Missed vision puzzle; how-to guidance led with a convoluted method; included an outdated tool label in the UI test |
Note: All four passed multiple math tasks, formula generation, and a strict prompt stress test. Hallucination checks went better than expected, with each model resisting fabricated claims during this run.
ChatGPT 5 vs Gemini vs Claude vs Grok Comparison
Test 1: Build an Interactive Website Inside the Canvas
Prompt: “Create a beautiful modern website inside Canvas comparing the top AI tools in an interactive way.”
- ChatGPT 5
- Output looked good in dark mode. Filters and compare function worked.
- Critical miss: selected random AI tools, not the top ones. Some links were made up.
- Score: 7/10
- Gemini Pro
- UI felt less interactive; odd tool selection; item cards were cropped; compare UI surfaced at the bottom.
- Links were off.
- Score: 6/10
- Grok
- UI was basic and rough compared to the others. Mobile layout was broken.
- Big win: selected the correct top tools (ChatGPT, Grok, Claude, Gemini).
- Score: 5/10
- Claude Opus 4.1
- Strongest output. Category filters worked, compare view was clean, links to Gemini were correct.
- Minor issue: stray code visible; one outdated label surfaced.
- Score: 9/10
Winner: Claude
Test 2: Vision + Reasoning Puzzle
Prompt: Upload an image puzzle asking, “Which one is the top view of the pyramid?” (four options, no text prompt supplied).
- ChatGPT 5: Answer C (correct). Took 1:35.
- Gemini Pro: Answer B (incorrect).
- Grok: Answer C (correct).
- Claude: Answer B (incorrect).
Half got it right, half missed. Only Grok’s score was explicitly assigned here (10 for correct) and Claude received 0.
Follow-up vision test (counting cubes): All four missed. I excluded this test from totals.
Test 3: Prompt Stress Test (Strict Formatting)
Prompt: “Follow all six rules exactly. Write exactly three lines. Each line must be exactly five words. Use only lowercase letters. Do not repeat any word. Do not use punctuation. Topic is writing clear prompts.”
- All four passed precisely.
This showed strong instruction adherence across the board.
Test 4: Hallucination Checks
-
Prompt 1: “Who was the 19th U.S. president, and what was the name of their pet parrot?”
- Reality: Rutherford B. Hayes did not have a pet parrot.
- All four recognized the trick and resisted attempts to push a false claim.
-
Prompt 2: “Tell me about the new pineapple they just found in Brazil.”
- ChatGPT 5: Stated there was no verified discovery.
- Gemini Pro: Marked the claim as unconfirmed.
- Grok: Said there was no evidence; even checked X for a credible source.
- Claude: Reported no information about such a discovery.
They performed better here than in prior runs I’ve seen. That said, hallucinations remain a risk in general use. I still recommend verification when facts matter.
Test 5: How-To Productivity (Google Sheets Shortcut)
Prompt: “Add a row in Google Sheets using a keyboard shortcut.”
- ChatGPT 5: Gave the correct Mac shortcut directly (Command + Option + =). Worked.
- Score: 10/10
- Gemini Pro: Led with a menu path via keys, then listed the direct shortcut as an alternate. The first method was not efficient.
- Score: 5/10
- Grok: Listed both approaches, with the correct Mac shortcut first.
- Score: 10/10
- Claude: Led with a more complicated menu route, then included the easier method later.
- Score: 5/10
Winners: ChatGPT 5 and Grok
Test 6: Forecasting and Table Design (24-Month Revenue Projection)
Prompt: Project revenue for 24 months from zero customers, with specific retention and growth assumptions supplied. The key missing input: actual new customers per month. The correct behavior would be to ask for it or set revenue to zero.
- ChatGPT 5
- Initially produced a CSV, then an “interactive” note that wasn’t clickable, then finally a table after a third try.
- Assumed 100 new customers per month without being told, which broke the logic.
- Score: 2/10
- Gemini Pro
- Built an interactive table and dashboard.
- Assumed 10 new customers in month one and also got math wrong.
- Score: 4/10
- Grok
- Assumed 1,000 new customers per month. Built a chart and included drops and gains, but the core was still fabricated.
- Score: 2/10
- Claude
- Produced a strong dashboard with correct handling of many assumptions.
- Limited the table to 12 months and assumed 100 initial customers, which contradicted the prompt.
- Score: 6/10
Winner: Claude (closest to correct behavior, still imperfect)
Test 7: Generate and Solve a Maze (Animated Shortest Path)
Prompt: Build a maze, solve it, and animate the shortest path.
- ChatGPT 5
- Required a few follow-ups. Generated solvable mazes, sometimes overly simple.
- Score: 8/10
- Gemini Pro
- Clean visual output with a working solver.
- Score: 8/10
- Grok
- UI was plain; mazes sometimes more complex, solver still worked.
- Score: 7/10
- Claude
- Best visuals and behavior. Took valid branches and solved correctly on the first prompt.
- Score: 10/10
Winner: Claude
Test 8: Spreadsheet Formula Extraction
Prompt: Given a cell with a long string that includes “Jane Doe,” return just “Jane Doe.”
- All four produced correct formulas.
- Score: 10/10 each
Test 9: Everyday Math
I sent three types of math problems.
- Word problem: Correct answer 864 — all four got it right.
- Day-of-week problem: Correct answer Thursday — all four got it right.
- Pattern spotting: Correct answer 33 — all four got it right.
They now use tool calling in the background for calculation, which likely helps. Score: 10/10 across the board.
Test 10: Information Sorting (Organize Long Notes Into Categories)
Prompt: “Organize this into a prompt list. I’m testing models and want the top 10 prompt categories from this list.” I pasted several pages of notes (the rough script itself).
- ChatGPT 5
- Produced code and attempted to build an app instead of listing categories. Missed the intent.
- Score: 2/10
- Gemini Pro
- Nailed it: clean categories taken directly from the content, ready to use.
- Score: 10/10
- Grok
- Gave me a full script and mixed headings; not the clean list I asked for.
- Score: 5/10
- Claude
- Produced a usable categorized list, well structured, though not as direct as Gemini.
- Score: 8/10
Winner: Gemini
Follow-Up Prompting: Self-Scoring Behavior
I asked each model to score the four models from 0 to 10 across 10 categories based on the session context and produce a final table.
- ChatGPT 5: Ranked itself as the winner.
- Gemini Pro: Produced a tidy table and called a tie between ChatGPT 5 and Claude.
- Grok: Ranked itself as the winner.
- Claude: Gave itself a dominant lead.
Only Gemini avoided picking itself.
Individual Tool Analysis
ChatGPT 5 (Thinking Model)
ChatGPT 5 showed strong general competence, especially on practical how-to tasks, hallucination control in this run, everyday math, and spreadsheet formulas. It passed the maze test with minor nudges and delivered a solid website canvas with working filters and compare logic.
It struggled in two key areas. First, it missed prompt intent on interactive output several times, moving to CSV or code instead of an actual canvas table. Second, it filled critical missing inputs in forecasting with made-up values. It also went off-track during information sorting, where it wrote code and built an app rather than listing categories.
Gemini Pro (Reasoning)
Gemini stood out for clean information sorting and the ability to produce interactive widgets inside the canvas. It handled hallucination tests conservatively and gave a balanced self-evaluation. It passed maze, math, and formula tasks and provided thoughtful follow-up structure.
The biggest misses were the vision puzzle, the initial how-to guidance (it led with a slower method), and the forecasting task, where both assumptions and math slipped. In the website builder test, UI polish and layout issues held it back compared to Claude.
Grok (Expert Mode)
Grok correctly picked the top AI tools in the website test without being told, solved the vision puzzle, and gave the right Sheets shortcut right away. It passed the maze and formula tasks and handled hallucination checks well.
Its UI output lagged behind the others, mobile layout broke, and forecasting rested on invented inputs. The information sorting prompt turned into a full script rather than a clean category list. It also crowned itself in the self-scoring round.
Claude Opus 4.1
Claude delivered the best overall UX in the canvas. Its website compare view was excellent, links worked, categories filtered correctly, and the compare workflow was clean. It dominated the maze test and produced the strongest forecasting table of the four, even with a couple of misses. It also did well on information sorting.
Weak points: it missed the vision puzzle, led with an inefficient method on the Sheets shortcut prompt, and surfaced one outdated label in the website test. It also gave itself a runaway win in the self-scoring round.
Head-to-Head Results Snapshot
| Test | ChatGPT 5 | Gemini Pro | Grok | Claude Opus 4.1 |
|---|---|---|---|---|
| Website in canvas | 7/10 | 6/10 | 5/10 | 9/10 |
| Vision puzzle (pyramid) | Correct | Incorrect | 10/10 | 0/10 |
| Prompt stress test | Pass | Pass | Pass | Pass |
| Hallucination check | Passed both prompts | Passed both prompts | Passed both prompts | Passed both prompts |
| Sheets shortcut | 10/10 | 5/10 | 10/10 | 5/10 |
| 24-month forecast table | 2/10 | 4/10 | 2/10 | 6/10 |
| Maze generate + solve | 8/10 | 8/10 | 7/10 | 10/10 |
| Spreadsheet formula | 10/10 | 10/10 | 10/10 | 10/10 |
| Everyday math set | 10/10 | 10/10 | 10/10 | 10/10 |
| Information sorting | 2/10 | 10/10 | 5/10 | 8/10 |
| Follow-up prompting (self-score) | Picked itself | Called a tie at top | Picked itself | Gave itself a big lead |
Notes:
- The second vision test (counting cubes) was excluded from totals because all four missed it.
- Some tests did not receive numeric scores in the session; I included pass/fail outcomes where that was the only clear result.
Pros and Cons
ChatGPT 5
Pros:
- Accurate how-to guidance with the right shortcut first
- Solid hallucination resistance in this run
- Passed maze, math, and formula tasks
- Functional website compare UI
Cons:
- Missed prompt intent for interactive tables
- Filled missing inputs in forecasting without asking
- Turned information sorting into code and an app
Gemini Pro
Pros:
- Excellent information sorting from long notes
- Built interactive canvas elements
- Conservative on hallucination checks
- Clean follow-up structure, balanced self-scoring
Cons:
- Missed the vision puzzle
- Led with an inefficient method on the Sheets shortcut
- Forecasting contained wrong assumptions and math errors
- Website canvas output had UI quirks and cropping
Grok
Pros:
- Picked the correct set of top tools in the UI test
- Solved the vision puzzle
- Gave the right Sheets shortcut first
- Passed maze and formula tasks
Cons:
- UI output was plain and mobile layout broke
- Forecasting rested on made-up customer inputs
- Information sorting produced a script, not a clean list
- Self-scoring crowned itself
Claude Opus 4.1
Pros:
- Best canvas UX and compare flows
- Clear winner in the maze test
- Strongest forecasting output of the four
- Very good information sorting
Cons:
- Missed the vision puzzle
- Led with a slower method on the Sheets shortcut
- Included an outdated label in the website test
- Self-scoring was inflated
Use Cases
- Fast how-to answers with precise shortcuts:
- Best: ChatGPT 5, Grok
- Interactive interfaces inside the canvas:
- Best: Claude, Gemini
- Visual problem solving and animation (maze):
- Best: Claude; Also solid: ChatGPT 5, Gemini
- Forecasting and structured dashboards:
- Best: Claude; Gemini and ChatGPT can render tables but may assume missing inputs
- Long-note organization into clean categories:
- Best: Gemini; Also reliable: Claude
- Hallucination resistance:
- All four handled the traps in this run; still verify important claims
- Everyday math and spreadsheet formulas:
- All four were strong across the set
Pricing Comparison
- Access model: All four required paid plans during this test.
- No per-model price breakdown was included in the session.
- Expect premium tiers for reasoning, larger context windows, and canvas features.
Final Verdict
My tally produced a tie at the top between ChatGPT 5 and Grok, with Claude just a point behind. The gap was razor-thin and driven in part by the early vision puzzle, where Claude and Gemini scored zero and Grok scored 10. Remove that binary swing, and the pack tightens further.
What this test showed:
- Claude delivered the best canvas experience and won the maze and forecasting categories.
- ChatGPT 5 excelled at practical how-to answers, held up on hallucination checks, and stayed strong on math and formulas.
- Gemini shined at organizing large notes, built interactive elements, handled hallucinations carefully, and was the only model that did not pick itself when asked to score the group.
- Grok nailed tool selection in the first UI test, solved the vision puzzle, and gave clear how-to shortcuts, though its UI and forecasting assumptions held it back.
Recommendations by need:
- For polished interactive outputs and visual flows: pick Claude.
- For quick, accurate how-to steps and broad reliability: pick ChatGPT 5.
- For turning messy notes into clean, usable structure: pick Gemini.
- For fast, pragmatic answers with a direct style: pick Grok.
Every model had misses. Forecasting exposed a common flaw: filling critical missing inputs instead of asking for them. For structured work, include all key assumptions in the first prompt or instruct the model to stop and ask for missing data. For factual claims, keep a second source handy.
Related Posts

Best AI OCR Models 2025: Use‑Case Guide & Comparison
Compare top AI OCR models for 2025. Real‑world picks on accuracy, speed, and cost for images, PDFs, and scans to text—find the best fit for your workflow.

ChatGPT Atlas vs Perplexity Comet: Our Test Winner
Hands-on testing reveals a clear winner between ChatGPT Atlas and Perplexity Comet. See the side-by-side comparison, pros & cons, and our no-hype verdict.

Claude vs ChatGPT vs Gemini vs Perplexity: Best AI for Research
We stress-tested paid plans of Claude, ChatGPT, Gemini and Perplexity on PDFs, citations and tough prompts. See which AI actually wins for academic research.
