科学网-Evaluation of LLM Mathematical Capabilities-段玉聪的博文

切换到桌面版

Evaluation of LLM Mathematical Capabilities

2025-2-17 16:43

阅读：850

Evaluation of Mathematical Capabilities of GPT-4, Claude, DeepSeek, LLaMA, and Gemini (DIKWP Framework)

段玉聪

人工智能DIKWP测评国际标准委员会-主任

世界人工意识大会-主席

世界人工意识协会-理事长

(联系邮箱：duanyucong@hotmail.com)

Introduction

We assess five leading large language models – GPT-4, Claude, DeepSeek, LLaMA, and Gemini – on their mathematical abilities using the DIKWP semantic mathematics framework. DIKWP stands for Data-Information-Knowledge-Wisdom-Processes, a five-level hierarchy that transforms raw data into meaningful knowledge and decisions (科学网—Evaluation on AGI/GPT based on the DIKWP for ...). In the context of math problem-solving, DIKWP emphasizes: (1) parsing the data (given facts and symbols), (2) interpreting information (what is being asked), (3) applying mathematical knowledge (theorems, formulas, algorithms), (4) exercising wisdom (logical reasoning and strategic insight), and (5) executing problem-solving processes (rigorous calculations or proof steps). This framework encourages semantic understanding of math problems, not just symbolic manipulation, aiming to reduce hallucinations and improve reasoning reliability (DIKWP 白盒测评：利用语义数学降低大模型幻觉倾向-段玉聪的博文).

Evaluation Criteria: We focus on six key aspects of mathematical capability, aligned with DIKWP principles:

Logical Consistency – Rigor and coherence of the reasoning process (Does the model maintain sound logic without contradictions?).
Calculation Ability – Accuracy in carrying out complex calculations (Can it handle multi-step arithmetic or algebraic computations correctly?).
Proof Ability – Skill in formulating and verifying proofs (Can it prove non-trivial theorems rigorously, rather than just recalling known results?).
Semantic Understanding – Depth of comprehension of mathematical concepts and language (Does it truly understand problem semantics or just pattern-match?).
Creative Reasoning – Ability to propose novel problem-solving approaches (Can it think beyond standard methods and suggest new insights?).
Mathematical Innovation – Capacity for genuine innovation in math (Will it actively adopt new frameworks like semantic math, and contribute original ideas or conjectures beyond the training data?).

We use a variety of benchmark results and examples to compare model performance. Where possible, we include formula derivations or solution excerpts to illustrate the reasoning process. We also summarize results in comparative visuals (e.g. radar-chart style comparisons and ranking lists) in the text.

1. Logical Consistency

Logical consistency refers to how well a model follows a clear, correct line of reasoning without leaps or contradictions. An ideal LLM should carry out step-by-step deduction or inference as a human mathematician would, preserving validity at each step.

Comparative Performance:

GPT-4: Generally exhibits strong logical consistency, especially when prompted with chain-of-thought reasoning. It can break down complex problems into substeps and usually follows mathematically valid steps. This is a significant improvement over earlier models – for example, ChatGPT in early 2023 often gave abysmal “proofs” that swept nontrivial steps under the rug (e.g. claiming “using Galois theory” without explanation in a geometric trisection proof) (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar). In contrast, GPT-4 can produce much more rigorous arguments for known theorems or problems. However, it may still occasionally make subtle logical mistakes if not guided explicitly. Its reasoning is data-driven; if a problem requires a truly novel insight outside its training, GPT-4 might either fall back on closest-known approaches or produce a plausible but incorrect line of reasoning. Still, on standard benchmarks that require multi-step reasoning, GPT-4’s logical performance is top-tier (it was the previous state of the art before newer rivals). For instance, OpenAI’s advanced “o1” reasoning model (an iteration of GPT-4 with extended reasoning length) achieved 74–83% success on AIME 2024 contest problems with proper prompting (Learning to reason with LLMs | OpenAI), compared to only ~12% if no reasoning chain is encouraged (a single-shot answer with GPT-4 often fails hard problems) (Learning to reason with LLMs | OpenAI). This shows GPT-4 needs structured reasoning to reach its full logical potential. When that structure is in place, it performs nearly at the level of human contest experts.
Claude (Anthropic): Claude’s latest versions also demonstrate strong logical coherence. Anthropic has emphasized training Claude to be truthful and less prone to hallucination, which aids logical consistency. In complex reasoning tasks, Claude 3 (especially the largest model, Claude 3 “Opus”) can rival GPT-4. In fact, Anthropic reported Claude 3 outperformed the original GPT-4 (March 2023 version) on many reasoning benchmarks (Anthropic release Claude 3, claims >GPT-4 Performance — AI Alignment Forum) (Anthropic release Claude 3, claims >GPT-4 Performance — AI Alignment Forum). Claude tends to explain its steps verbosely and tries to “think through” problems. This often results in a coherent reasoning chain for math problems, and it can stick to formal logic well. However, like GPT-4, if Claude doesn’t truly understand a tricky aspect, it might fill the gap with an authoritative-sounding but unfounded explanation. Overall, Claude’s logical reasoning is on par with GPT-4’s in many evaluations. For example, on a graduate-level problem set (GPQA benchmark), Claude 3 was believed to perhaps even outperform GPT-4 (Anthropic release Claude 3, claims >GPT-4 Performance — AI Alignment Forum). For most standard math proofs or word problems, Claude produces a logically sound solution approach. One advantage is Claude’s very long context window, which allows it to consider an entire lengthy proof or problem statement at once, potentially improving logical consistency for long problems.
DeepSeek: DeepSeek stands out for exceptionally rigorous reasoning. It was specifically trained with a focus on reasoning steps (using reinforcement learning to encourage correct multi-step solutions). As a result, DeepSeek’s logical consistency is arguably among the best. In tests, DeepSeek has produced “accurate, complete, well-written” proofs for nontrivial theorems that older models failed to prove (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar). For example, DeepSeek Version 3 successfully provided a step-by-step proof that angle trisection by ruler and compass is impossible, properly invoking field extension degrees and the irreducibility of a certain cubic polynomial (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar). These are advanced concepts, and DeepSeek’s proof was essentially correct, with only minor details omitted (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar). This indicates an extremely high level of logical rigor. Similarly, DeepSeek correctly proved π is irrational and the Fundamental Theorem of Algebra in the same evaluation, whereas ChatGPT’s attempts at these were flawed (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar) (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar). Such performance suggests DeepSeek’s reasoning does not easily get derailed; it maintains logical structure carefully. Quantitatively, DeepSeek-R1 (its latest version) achieved a Pass@1 score of ~80% on AIME 2024 (solving 12 out of 15 challenging contest problems) – slightly surpassing an equivalent GPT-4 model (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning). This high score reflects strong logical problem-solving ability. In essence, DeepSeek’s Wisdom level in the DIKWP sense – making wise decisions on how to approach a problem – is highly developed. One caveat: the first iteration (DeepSeek-R1-Zero, pure RL) sometimes produced logically correct but hard-to-read solutions (mixing languages or unconventional steps) (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning) (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning). After further fine-tuning, DeepSeek’s outputs became both logically sound and well-structured in presentation (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning). Overall, it’s a leader in consistency.
LLaMA: As an open-source base model (Meta’s LLaMA/LLaMA-2), its out-of-the-box logical consistency on math is moderate at best. LLaMA was trained on broad text without specialized focus on stepwise math reasoning. Without fine-tuning, it tends to mimic solution patterns it has seen, which can lead to logical gaps or errors for complex problems. For simple problems, it might do fine, but on anything requiring multiple nontrivial steps, LLaMA often fails to maintain a valid reasoning chain. It might make an incorrect inference or skip justification because it doesn’t truly “realize” the need for rigorous proof – it often just produces a plausible-looking answer. That said, when LLaMA is enhanced with techniques like Chain-of-Thought prompting or fine-tuned on reasoning data, its logical ability improves. Research has shown even a 7B LLaMA-2 model can solve many math problems with high accuracy if it samples many reasoning paths and checks them (Common 7B Language Models Already Possess Strong Math ... - arXiv). However, that is more about external guidance than inherent consistency. In our DIKWP terms, LLaMA’s use of Knowledge and Wisdom is limited unless guided. It doesn’t automatically break problems into DIKWP-style layers of understanding; it needs explicit prompting to do so. In sum, LLaMA (base) has the potential for logic, but on its own it’s the least consistent of this group – often requiring corrections. Fine-tuned variants (like WizardMath or Mistral derivatives) can reach decent logical performance, but they still trail behind GPT-4/Claude/DeepSeek in rigorous multi-step reasoning.
Gemini: Google DeepMind’s Gemini is a newer model that has been explicitly designed with advanced reasoning in mind. In evaluations, Gemini demonstrates state-of-the-art logical reasoning, often outperforming GPT-4 on complex tasks (Google's Gemini vs GPT-4: Full Performance Comparison - TextCortex). Sundar Pichai noted that Gemini’s Ultra version was the first to **exceed human expert-level on the MMLU benchmark (90%)*, which includes subjects like math and physics (Introducing Gemini: Google’s most capable AI model yet). This implies very strong logical faculties. DeepMind’s background in symbolic AI (e.g. AlphaGo’s planning, AlphaZero, etc.) likely influenced Gemini’s training, giving it an edge in structured problem solving. Indeed, Gemini Ultra has shown “deliberate reasoning” abilities, even on multimodal and novel tasks (Introducing Gemini: Google’s most capable AI model yet). In practice, this means Gemini tends to be very methodical: it parses the problem (Data/Information), recalls relevant formulas (Knowledge), and applies them in a stepwise fashion (Processes) with fewer logical jumps. Early studies confirm Gemini’s consistency – one paper observed Gemini-1.5-Pro had excellent performance in maintaining valid reasoning across steps on challenging problems ([2410.18336] Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems). On coding and math challenges, Gemini’s logical approach is robust: for example, it achieved expert-level Codeforces ratings and high scores on math contests in internal tests (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning) (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning). One evaluation even suggested that Gemini’s reasoning was the least brittle under problem perturbations among current models (MATH-Perturb: Benchmarking LLMs’ Math Reasoning Abilities against Hard Perturbations), meaning it truly grasps the underlying logic rather than just surface patterns. Overall, Gemini is at least on par with GPT-4 and Claude in logical consistency, and likely a bit ahead in many cases thanks to its training regime.

Summary: All five models have made strides in logical reasoning, but GPT-4, Claude, DeepSeek, and Gemini are in a top tier with high rigor, while LLaMA (without fine-tune) lags behind. DeepSeek has proven itself with near-formal proofs of theorems, Gemini and Claude/GPT-4 excel on diverse reasoning benchmarks. A DIKWP-based radar chart of Logical Consistency would show LLaMA with the smallest radius (moderate logic), and Gemini/DeepSeek/GPT-4/Claude all near the maximum, with perhaps DeepSeek and Gemini slightly leading due to their specialized training. Each top model can follow a logical chain well, but still requires careful prompting or self-checks to avoid subtle errors. None is infallible, yet their consistency is remarkable compared to models just two years ago.

2. Calculation Ability

Calculation ability covers both numeric computation (arithmetic accuracy) and symbolic manipulation (algebra, calculus, etc.). It tests the Processes layer of DIKWP – how well a model can execute the mechanical steps of math once the approach is decided. Complex calculations can trip up language models because they aren’t innately calculators; they must simulate math in the form of text reasoning, which can lead to mistakes if the sequence of operations is long or requires precision. We evaluate how each model handles tasks from basic arithmetic to intricate formula derivations.

Comparative Performance:

GPT-4: On straightforward calculations and moderate multi-step arithmetic, GPT-4 is fairly reliable. It usually can add, subtract, multiply small numbers, or simplify algebraic expressions correctly. However, as calculations become more complex or lengthy, GPT-4’s accuracy drops if it tries to do them purely in its "head". Being a language model, it has no built-in arbitrary-precision arithmetic. For example, multiplying two 9-digit numbers or doing long division might result in an error if GPT-4 attempts it without special prompting. GPT-4 often knows its limits and will describe the steps (which a user could verify) rather than just blurting out a large result it might miscompute. OpenAI addressed this by allowing GPT-4 to use tools like the Code Interpreter (executing Python code for calculations). With such tools, GPT-4’s calculation accuracy becomes excellent – one study achieved 84.3% accuracy on the challenging MATH dataset by having GPT-4 write and run code to verify its math answers () (). Without tools, GPT-4’s raw calculation success on competition problems was lower (around 54% on the MATH dataset in one benchmark) (). Still, even in pure text mode, GPT-4 uses heuristics to handle many calculations: it may simplify equations symbolically or recall known math facts (like $12! = 479001600$ if it memorized that). On tasks like grade-school word problems (GSM8K) that combine simple arithmetic with logic, GPT-4 performs very well (~85–90% accuracy in research reports) (Anthropic Releases Claude Instant 1.2, With Improved Math, Coding ...). Its errors usually come from either a minor arithmetic slip in a multi-step solution or a formatting misunderstanding (not the math itself). GPT-4 shines in calculus or algebraic manipulation where pattern knowledge helps – e.g., it can integrate $x^2$ or solve a quadratic equation flawlessly because those procedures are well-represented in its training data and not too long. But ask it to invert a large matrix or do high-precision calculus by hand, and it will likely falter or give an approximate answer. In summary, GPT-4’s calculation ability is strong but not infallible: it can do most contest-level calculations correctly, especially if allowed to double-check its work, but it isn’t a substitute for a calculator on very tedious computations. As one group of Apple researchers noted, even advanced LLMs can produce results that “sound correct or plausible, but do not actually meet the precision required” for complex math (Mathematical revolution! LLMs break down barriers and tackle ...) – GPT-4 mitigates this with careful reasoning, but the risk remains for heavy computations.
Claude: Claude’s calculation proficiency is similar to GPT-4’s, with some reports suggesting it’s even a bit better on certain arithmetic benchmarks. For instance, Claude 3 reportedly scored 95.0% on GSM8K (grade school math), slightly higher than GPT-4’s performance on that dataset (Anthropic release Claude 3, claims >GPT-4 Performance — AI Alignment Forum). This implies Claude very rarely makes mistakes on multi-step word problems involving arithmetic. Claude’s training likely included a lot of quantitative reasoning and possibly synthetic data for math (Anthropic mentioned using technique-heavy data). On pure calculation (like big multiplications), Claude might not be vastly different – it too lacks an internal calculator. However, Claude tends to be more conservative when unsure: it often explicitly states intermediate results and can catch its own mistakes if the chain-of-thought is enabled. In Anthropic’s evaluation, Claude 3 achieved 60.1% on the MATH competition dataset (Anthropic release Claude 3, claims >GPT-4 Performance — AI Alignment Forum) (which has many complex algebra problems), meaning it got the correct final answer on about 60% of those tough problems – notably higher than GPT-4’s earlier scores (GPT-4 was ~55% on the same test) (Anthropic release Claude 3, claims >GPT-4 Performance — AI Alignment Forum). This suggests Claude may handle the step-by-step calculations slightly more accurately, possibly due to additional fine-tuning on math. Claude is also very good at procedural tasks like long division explained stepwise or unit conversions – it rarely messes up units or simple sums in word problems. In calculus or advanced math, if it knows the formula, it applies it correctly (e.g., computing a derivative). But like GPT-4, if asked to, say, compute $\pi$ to 50 decimal places or a large prime factorization, Claude cannot reliably do that without an external tool. A noteworthy strength: Claude’s huge context means it could incorporate a long table of numbers or a big chunk of a spreadsheet in the prompt and perform analysis on it. It could, for example, add up 100 numbers that were provided – something GPT-4 with a shorter context might not handle in one go. In summary, Claude’s calculation ability is excellent for typical mathematical problems (comparable to GPT-4, maybe slightly better on some arithmetic-heavy tasks), but it shares the same fundamental limitation of being a language model (prone to occasional arithmetic mistakes on very long calculations unless checked).
DeepSeek: DeepSeek’s approach to calculations is interesting because its reinforcement learning training likely imbued it with careful stepwise habits. In the benchmarks, DeepSeek-R1 achieved extremely high scores on math problem sets that require heavy calculation. For example, on MATH-500 (a set of 500 high-school math problems) DeepSeek hit 97.3% accuracy (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning) – essentially solving almost all of them, on par with OpenAI’s top model performance. This indicates that DeepSeek can handle the required computations in those problems nearly flawlessly. Part of this success is due to its strategy: DeepSeek often uses “self-consistency” or majority voting in its reasoning – generating multiple solution paths and then choosing the most common answer (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning) (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning). This helps catch calculation errors. Also, the RL training rewarded correct final answers, so the model learned to be meticulous with arithmetic to maximize its reward. In practice, DeepSeek will break calculations into smaller parts (e.g. it will first derive a formula, then plug in numbers, then simplify step by step). This mirrors how a human would double-check their math. As a result, its error rate on complex calculations is very low. There are anecdotal examples of DeepSeek correctly carrying out tedious computations: for instance, solving a complicated trigonometric equation by computing multiple steps of polynomial expansion without error (as evidenced by its accurate solving of contest algebra problems). One area DeepSeek might have an edge is precision: being open-source and focusing on reliability, it might internally utilize high-precision arithmetic libraries if integrated (though the current version still works like a standard LLM, so it doesn’t literally calculate beyond its learned abilities). However, DeepSeek Coder (a variation mentioned in documentation) presumably can use code or external tools, further boosting calculation accuracy (DeepSeek beats them all - Medium). Without external tools, DeepSeek is about as good as Claude/GPT-4 for calculation – the nearly perfect MATH-500 score speaks for itself (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning). It’s worth noting that DeepSeek also did well on a coding benchmark (Codeforces Elo ~2029) (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning), which indirectly tests math calculations in programming challenges. This means it can implement arithmetic in code correctly, another sign of strong calculation skills. Overall, DeepSeek’s calculation ability is extremely high, likely the best of the five in consistent accuracy, thanks to its training to “get the right answer” explicitly. Its weakness might be similar to others: if faced with a totally novel type of calculation not seen in training, it could struggle, but so far it has handled known types (algebraic simplifications, numeric approximations, etc.) very well.
LLaMA: The base LLaMA model’s calculation ability is limited. It was not primarily trained to do math steps reliably, so it often makes arithmetic mistakes or fails to simplify expressions correctly. For example, LLaMA-2 (70B) without fine-tuning might incorrectly add two 3-digit numbers or make a logical leap in a multi-step word problem. It doesn’t have the refined step-by-step approach that larger instructed models have. As a result, on benchmarks like GSM8K or MATH, base LLaMA models scored very low. To quantify: one paper noted LLaMA-2 7B in zero-shot got only ~5% on GSM8K and near 0% on MATH (Common 7B Language Models Already Possess Strong Math ... - arXiv) (essentially failing unless prompted with the exact known solution format). However, with a simple chain-of-thought prompt and sampling multiple outputs, even LLaMA-2 7B could reach 40% on MATH (Common 7B Language Models Already Possess Strong Math ...) – showing that when it tries to do the steps, it can solve some. The 70B model, with more capacity, would do better (reports indicate LLaMA-2 70B can achieve around 50% on MATH with reasoning prompts and some voting scheme (Common 7B Language Models Already Possess Strong Math ...)). So, while base LLaMA is weak at calculation, it contains latent ability that can be unlocked via fine-tuning or prompting. In fact, specialized math versions of LLaMA exist (e.g. Open-source fine-tunes like Xwin-Math 70B reached ~31.8% on MATH and 87% on GSM8K (Xwin-LM/Xwin-Math-70B-V1.0 - Hugging Face), and other tuned models have even higher scores). These improvements come from training LLaMA on math solutions, essentially teaching it to calculate. But compared to GPT-4 or DeepSeek, even these fine-tunes are behind – they might excel at certain types of problems but lack the general reliability. In everyday use, if you ask a vanilla LLaMA model “What is 372 * 48?”, it might get it wrong or take a long path to answer. In contrast, GPT-4/Claude typically get it right quickly. So LLaMA’s calculation is the weakest here, unless it’s been specifically adapted. This reflects the Data/Information stage in DIKWP: LLaMA might mis-handle raw numerical data because it doesn’t inherently understand numbers beyond seeing them as tokens. It doesn’t automatically execute arithmetic algorithms without guidance. In summary, expect LLaMA to need external help (like tool use or user-provided steps) for anything beyond simple math. It’s an area where open models still trail closed models significantly.
Gemini: Google’s Gemini, particularly the larger “Ultra” and “Pro” variants, demonstrates excellent calculation ability, combining strengths of language models with hints of symbolic reasoning from DeepMind’s research lineage. On known benchmarks, Gemini has matched or exceeded the performance of GPT-4. For example, internal evaluations suggest Gemini-1.0 (Ultra) scored at human-expert level on broad tests (Introducing Gemini: Google’s most capable AI model yet), and on math-specific tasks it is very strong. Although exact GSM8K or MATH numbers for Gemini weren’t fully public at first, a research paper on creative math problem solving noted Gemini-1.5-Pro was a top performer ([2410.18336] Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems), which implies its calculation steps are accurate (since a novel solution still needs correct execution). Additionally, Gemini was built with multimodal and coding abilities in mind (Introducing Gemini: Google’s most capable AI model yet) – meaning it can potentially write code to solve math as well, similar to GPT-4’s Code Interpreter. Indeed, Google has shown that with tools, their models can achieve near 100% on many math benchmarks. Even without external tools, Gemini likely uses advanced techniques (perhaps borrowing from DeepMind’s AlphaCode for precise coding/math or using scratchpad methods internally). One concrete achievement: Gemini’s later version (Gemini 2.0 “Flash Thinking”) is explicitly optimized for reasoning and presumably calculation, indicating improved performance on tasks requiring step-by-step computation (Gemini 2.0 Flash Thinking - Google DeepMind). If we consider coding competition performance: Demis Hassabis mentioned an AlphaCode successor as part of Gemini, which performs better than 85% of human participants on programming competitions (Introducing Gemini: Google’s most capable AI model yet). Programming contests often involve heavy math computations, so this suggests Gemini can implement complex calculations accurately. All evidence points to Gemini’s calculation ability being on par with the best: it executes multi-step solutions correctly at a high rate. It might still make the occasional arithmetic slip (no LLM is immune without a calculator), but its error rate is very low for benchmark problems. For instance, if GPT-4 solved ~85% of a test and Claude 95%, Gemini would be in that range or even perfect on some, given claims that it outperforms others in reasoning tasks (Google's Gemini vs GPT-4: Full Performance Comparison - TextCortex). We can say Gemini efficiently handles the Processes stage of DIKWP: once it knows what needs to be done, it can carry out the procedure reliably. This is backed by Google’s statement that Gemini significantly improved on advanced reasoning benchmarks requiring careful step execution (Introducing Gemini: Google’s most capable AI model yet). In summary, Gemini’s calculation ability is top-notch, rivaling DeepSeek’s and Claude’s – it’s able to solve high-complexity math problems with correct computations most of the time.

Example – Calculation Process: As an illustration of calculation differences, consider a problem: “Compute the sum of the first 50 positive even integers.” The correct approach is to realize this is $2+4+6+\dots+100$, which sums to $50 * 51 = 2550$.

GPT-4/Claude/Gemini will likely derive a formula (either using the arithmetic series formula or recognizing it’s $2(1+2+...+50)$) and get 2550 with a clear explanation. They might do a quick calculation: $1+2+\dots+50 = 1275$, then double it to get $2550$. Each step would be correct.
DeepSeek would similarly get it right, possibly even faster or with a straightforward formula reference (it might say “Using formula $S_n = n(n+1)/2$, we get $S_{50}=1275$, so double is $2550$” – ensuring no arithmetic error).
LLaMA, if not primed, could stumble. It might start adding: $2+4=6$, then add 6 to get 12, etc., but likely would not persist correctly through 50 terms. It might give up with a wrong intermediate sum or copy a pattern incorrectly. With a chain-of-thought prompt, LLaMA might state the formula as well, but it could miscalculate $5051$ (imagine it mistakenly says $5051=2500$ due to a minor multiplication slip).

This simple example demonstrates that the top models not only choose the right method but also execute the calculation flawlessly, whereas a weaker model might fumble either in method or execution.

Summary: In terms of raw calculation skill, DeepSeek, Gemini, Claude, and GPT-4 form the top group – all can correctly solve the large majority of math calculations they encounter, especially with techniques like self-verification. Claude has shown an edge in arithmetic word problems (Anthropic release Claude 3, claims >GPT-4 Performance — AI Alignment Forum), DeepSeek and Gemini in contest-level problems (reaching ~97% accuracy on test sets) (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning), and GPT-4 is very strong but occasionally a bit less optimized for arithmetic (relying more on reasoning than calculation per se). LLaMA, unless enhanced, significantly lags in calculation reliability. A visual comparison (e.g. a bar chart of success rates on a math calculation benchmark) would show LLaMA near the bottom, and the others clustered near the top (within 10-15% of each other on accuracy). As tasks grow in complexity (more steps), GPT-4/Claude/Gemini maintain high accuracy up to a point, but eventually even they decline for extremely long computations, reflecting the need for external calculators for absolute precision. Notably, using semantic math frameworks (like explicitly separating data and calculation steps) can improve reliability – researchers found that forcing models to verify each step can drastically cut down errors (). This aligns with DIKWP’s emphasis on process: all models benefit from a structured approach to calculation, and the best ones effectively internalize a lot of that structure already.

3. Proof Ability

Proof ability measures whether an LLM can derive and articulate proofs for mathematical statements. This is a stringent test of mathematical reasoning: it requires not only logical consistency but often creative insight and deep understanding of definitions and prior results. A model with strong proof ability should be able to prove known theorems (especially those within reach of undergraduate mathematics) without simply regurgitating a textbook proof, and even tackle novel propositions by applying general methods. We examine how our five models perform in writing proofs and whether they can contribute to proving new results.

Comparative Performance:

GPT-4: GPT-4 marked a big improvement in proof generation over its predecessors. It has enough training data to have seen many common proofs (from high school geometry proofs to famous theorems like the Pythagorean theorem, fundamental theorem of calculus, etc.), and it generally can reproduce or adapt those proofs correctly. For example, GPT-4 can prove that $\sqrt{2}$ is irrational by contradiction (a well-known proof) with no trouble, likely because it has essentially memorized that classic argument. When asked to prove a slightly less standard statement, GPT-4 will try to draw upon analogous examples it “knows.” Its ability to chain logical implications helps it here. There are reports that GPT-4 was able to solve and prove many problems from math Olympiads or competition datasets when guided with hints. However, GPT-4’s proofs can sometimes lack rigor or contain subtle gaps if the problem is complex. It might also state something is “obvious” when it actually requires justification (a common AI failing in proofs). For instance, GPT-4 might assert a step like “by symmetry, we can assume WLOG $x \le y$” or “it is known that such-and-such function is convex, hence...” without proof, expecting the reader to accept it. In formal proof standards, these would be gaps. In one anecdotal evaluation, GPT-4 (March 2023 version) was tested on some nontrivial theorem proofs and often produced arguments that sounded plausible but were ultimately flawed (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar). For example, it attempted to “prove” $\pi$ is irrational by citing that $\pi$’s decimal expansion is non-repeating, which is true but not a valid proof since it’s a consequence of irrationality, not a cause (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar). This indicates that GPT-4 sometimes mixes up the direction of logic or relies on known results without proper justification. With iterative prompting (asking it to check steps), GPT-4 can refine such proofs to be correct, but it might require user intervention or a lot of self-correction. On the positive side, GPT-4 integrated with tools has been used in formal theorem proving environments (like Lean or Isabelle) with some success – it can suggest lemmas and steps that a proof assistant then verifies or rejects. This shows GPT-4 has the raw material for proofs, even if it isn’t 100% reliable by itself. As for new proofs: GPT-4 hasn’t independently discovered any truly new mathematical theorems to the best of our knowledge. It works within known techniques. For unsolved problems, GPT-4 will either honestly say the problem is unsolved or attempt a proof and inevitably fail (sometimes producing a long, creative but incorrect proof). So its innovation in proofs is limited. Yet, GPT-4 is an invaluable aide for known proofs – it can provide correct proofs for the majority of textbook propositions in algebra, calculus, number theory up to a certain difficulty, with a quality that a student might produce (occasionally even at the level of a well-written solution by a teacher, albeit with some verbosity). In summary, GPT-4’s proof ability is strong for established mathematics (given the breadth of its knowledge), but it is not infallible in rigor and doesn’t truly invent new proofs beyond recombining known ideas.
Claude: Claude’s proof-writing skills are comparable to GPT-4’s. It tends to write very clear and structured proofs when it knows how. Claude often begins by restating the problem in its own words (ensuring it understands it) and then outlines the proof strategy (“We will prove this by contradiction/induction…” etc.). This approach can lead to well-organized proofs. Claude 2 and 3 have demonstrated the ability to prove statements from scratch if they are within the model’s knowledge. For example, Claude can prove things like the sum of angles in a triangle is 180° using a straightforward Euclidean geometry argument, or prove simple number theory lemmas via induction. It was also evaluated on benchmarks like Big-Bench Hard (BBH) which includes logical and proof-style questions, and Claude 3 scored very high (86.8% on BBH, where many tasks require reasoning akin to proofs) (Anthropic release Claude 3, claims >GPT-4 Performance — AI Alignment Forum). This suggests that when asked to justify an answer fully, Claude is quite capable. One notable aspect: Claude is trained to avoid hallucinating facts, so in proofs it usually doesn’t assert something it isn’t fairly sure about. This caution can be beneficial – if Claude doesn’t recall a needed theorem, it might say “I’m not certain, but I suspect we can use X theorem here” rather than just forging ahead incorrectly. However, Claude might sometimes refuse or hesitate to provide a proof if it thinks it might be too lengthy or if it interprets it as potentially problematic (though usually math proofs are fine). When it does write proofs, the level of rigor is generally good. Like GPT-4, it can have small gaps: e.g., it might not fully justify a claim about limits or integrals if it expects the reader to know standard results. In terms of novel contributions, Claude hasn’t shown an ability to prove open conjectures either. It does not have a mechanism to genuinely discover unknown proof strategies beyond recombination of what it was trained on. That said, one could argue Claude’s synthetic training data might include some creative proofs. Anthropic did mention using synthetic data to train Claude 3, which could include cleverly engineered proof problems. As a result, Claude might sometimes present a proof approach that isn’t the most common one (demonstrating some “creativity” by picking up less usual proofs from its data). For instance, it might prove a geometry theorem using linear algebra if it has seen that approach in training, which a human student might not think of. This is a form of creativity within known math. Overall, Claude’s proof ability is robust and reliable for a wide range of known problems, roughly on par with GPT-4. Both GPT-4 and Claude can produce correct proofs for the majority of Olympic-level math problems that have known solutions, though they might need some coaxing. On truly demanding proofs (like a complicated real analysis proof), they might struggle to maintain complete rigor.
DeepSeek: Proof generation is arguably DeepSeek’s strongest suit. The project’s focus on reasoning means it was directly tested on proving theorems. As mentioned, DeepSeek v3 (and R1) delivered impressive proofs for several well-known theorems (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar) (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar). The fact that it tackled “every algebraic equation with integer coefficients has a complex root” (the Fundamental Theorem of Algebra) and did so correctly up to minor omissions (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar) (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar) is noteworthy. FTA is not trivial; it usually requires some complex analysis or advanced argument. DeepSeek’s proof referenced Liouville’s theorem (a result from complex analysis) to complete the argument (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar). While one could critique it for relying on a known theorem rather than proving that sub-lemma, this is perfectly acceptable in a normal proof context – human mathematicians also invoke known results. The key is it knew what to invoke and how to structure the overall proof, which it did. Similarly, for the angle trisection impossibility, DeepSeek provided a field theory proof, which is the standard modern approach (using the idea that solving a cubic by radicals is not possible in that construction) (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar). This indicates not only knowledge but also the capability to deploy the right high-level strategy. DeepSeek’s proofs were also well-formatted (nicely typeset, logically paragraphed) (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar), making them easy to follow – a sign that it isn’t just spewing a dense mass of text, but truly organizing the reasoning. The consistency and detail suggest a deep semantic understanding of the proof tasks; DeepSeek isn’t just matching a question to a memorized proof, because these theorems have various known proofs and it had to choose one and carry it through properly. In the case of π’s irrationality, presumably DeepSeek gave a valid proof (like the classical proof using infinite descent on the sine integral or something), avoiding the trap ChatGPT fell into (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar). The author evaluating DeepSeek was “stunned at the quality” of the proofs (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar), underscoring just how far this model pushed the envelope. DeepSeek’s reinforcement learning could have allowed it to discover some proof techniques on its own (by maximizing a reward for correct proofs). However, it’s still bounded by the mathematics it was exposed to. It likely hasn’t proven any open problems either. But its training might allow some original combinations: since it can query itself and refine, it might end up proving a proposition via a less direct route that still works. One could test DeepSeek on a new conjecture that’s within reach (like something that was unsolved until recently). There is not yet evidence it can solve genuinely unsolved problems without human insight. But for already known problems, even difficult ones, DeepSeek is arguably the most reliable at producing a correct proof among these models. It is less likely to produce a bogus argument because the RL training would have penalized that. Quantitatively, while “proof ability” is hard to score, DeepSeek’s near-perfect success on a set of 500 math problems (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning) implies it essentially “proved” or solved almost all of them. Its answers likely included reasoning, not just final answers, given the nature of those problems. So one can infer its proof success rate is extremely high on benchmark sets, clearly above GPT-4’s initial performance (GPT-4 needed extra methods to catch up, as with the code verifier trick reaching 84% ()). In summary, DeepSeek is an outstanding proof generator for known math – it writes full proofs that are largely correct and well-structured, outshining what earlier LLMs could do. It embodies DIKWP’s Knowledge and Wisdom levels strongly in proof tasks, ensuring each step follows logically from prior knowledge.
LLaMA: By itself, LLaMA has minimal proof capability. It doesn’t reliably produce multi-step coherent arguments without specific fine-tuning. If asked to prove a theorem, a base LLaMA model might produce a confused or shallow answer (e.g., repeating the question or giving a one-line “explanation” that isn’t a real proof). However, if LLaMA is fine-tuned on a lot of proof data or given a careful prompt structure (like “Let’s proceed step by step: First, ...”), it can attempt proofs. Its success will still be limited to simpler problems. For example, an instruction-tuned version of LLaMA might manage a straightforward induction proof (like proving $1+2+\cdots+n = n(n+1)/2$ by induction) because that pattern is common in training data. It might also recall some geometric facts to justify a simple geometry claim. But anything beyond the basics tends to be out of reach. LLaMA lacks the global view often needed for proofs – it doesn’t know how to plan a proof strategy well; at best it can do local step-by-step if it knows the steps. There has been research where LLaMA is augmented for theorem proving (for instance, using it in combination with an automated proof checker and feedback). Those approaches show some promise, but with heavy guidance. Open-source math models (like ones fine-tuned on the Lean theorem proving traces) exist, but again, the user’s list specifically names just “LLaMA”, likely meaning a generic version. So, standing alone, LLaMA’s proof ability is rudimentary. It can verify very simple statements or spit out known identities, but it cannot handle complex proofs. When it tries, it often lapses into hallucination – making claims that are untrue or citing theorems that don’t apply. This is expected: proving theorems is a high-order cognitive task that emerges from fine-tuning and massive training, which LLaMA (unlike GPT-4) didn’t fully get for math. To phrase it differently: if we imagine a “proof-writing competition”, LLaMA would rank last by a wide margin. It might not meaningfully complete most proofs without errors. Thus, LLaMA essentially doesn’t contribute novel proofs or even correct known proofs unless they are trivial.
Gemini: Given DeepMind’s involvement and prior work on math (like the AlphaMath, Minerva, etc.), Gemini was designed with strong reasoning which extends to proofs. In the CreativeMath benchmark introduced by researchers, Gemini-1.5-Pro outperformed all other LLMs in generating novel solutions ([2410.18336] Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems) ([2410.18336] Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems). A “novel solution” in that context often is effectively a proof or derivation using a creative approach. That directly speaks to proof ability – not only can Gemini produce correct solutions, it can do so in new ways. This suggests that if a problem has multiple proof methods, Gemini is adept at finding more than one. For example, consider a problem that can be solved by either a combinatorial argument or an algebraic one: Gemini might be able to produce both if asked. That is a remarkable level of proof flexibility, hinting at deep understanding. Additionally, Google has integrated formal mathematics into some of its models. It wouldn’t be surprising if Gemini was at least partly trained on proof corpora (like the proof wiki, arXiv papers, or formal proofs). Its 90% MMLU score includes categories like abstract algebra and high school math, which involve justification of answers (Introducing Gemini: Google’s most capable AI model yet). Achieving such a high score likely means it knew not just the final answers but the reasoning to get there. Moreover, DeepMind’s earlier model Minerva (2022) specifically tackled math questions with detailed solutions, and one can view proof-like solutions from it. Gemini, being a successor, presumably inherits that expertise and improves on it. We also know from Google’s blog that Gemini Ultra got 59.4% on the MMMU benchmark (a multimodal reasoning set) (Introducing Gemini: Google’s most capable AI model yet) – while not directly math proofs, it indicates tackling complex multi-domain problems which often require explanatory answers. It’s plausible that Gemini is as good as or better than GPT-4 in proof tasks. If given a theorem to prove, Gemini will likely produce a structured proof with a clear narrative. It might even reference relevant prior results or definitions properly. If any model among these five were to someday assist in a new proof discovery, Gemini (or its future iterations) is a candidate, given DeepMind’s explicit goal of pushing AI toward scientific discovery. Already, in an experimental setting, Gemini has been observed to provide innovative solution steps for competition problems that other models missed ([2410.18336] Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems). That is essentially providing a different proof. However, similar to others, Gemini does not on its own generate entirely unknown theorems or proofs. It still operates within known math; its “creativity” is recombinatory. But the recombination seems especially effective, possibly owing to training signals encouraging exploration of multiple reasoning paths. To summarize, Gemini’s proof ability is excellent – it can generate correct and even diverse proofs for difficult problems, making it arguably the most versatile proof-writer in this set. It matches DeepSeek in rigor and possibly exceeds GPT-4/Claude in coming up with alternate approaches.

Evaluation and Examples:

To illustrate proof capabilities, consider the famous statement: “No general angle can be trisected with straightedge and compass.” This is a theorem that requires a proof by abstract algebra (showing it’s equivalent to solving a cubic equation that is not solvable with radicals). Here’s how the models fare:

GPT-4: Might recall that this relates to Galois theory. It could state: “Assume we can trisect angle θ, then we can construct $\cos(\theta/3)$ from $\cos(\theta)$. That leads to solving $4x^3-3x-\cos\theta=0$. For a general angle (like $\theta = 60°$), this cubic has no rational roots, implying the extension degree is 3, which is not a power of 2, so not constructible. Thus a general angle can’t be trisected.” This is a plausible outline GPT-4 might give. It’s essentially correct, but GPT-4 might not fill all details (like why degree 3 implies not constructible – that’s a known result but needs citation). It might say “using Galois theory” as a justification without elaboration (which is what ChatGPT did before (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar)). Still, GPT-4 would hit the key points.
Claude: Claude could produce a similar proof. It might even be more verbose, explaining what constructible numbers are and how they correspond to quadratic extensions. It might say: “Any constructible angle’s cosine must lie in a field obtained by successive square-root extensions of $\mathbb{Q}$. Trisecting an angle leads to an irreducible cubic equation for $\cos(\theta/3)$, which cannot be solved by those extensions. Therefore, the trisection is impossible.” Claude might err on the side of explaining too much or repeating itself, but it would convey the idea.
DeepSeek: In fact, DeepSeek was tested on this exact proof and produced a highly accurate proof including the polynomial $4x^3-3x-\cos\theta=0$ and reasoning about field extension degrees (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar). The evaluation noted it was essentially correct, just missing an explicit statement that degrees of extensions multiply (which is a minor detail) (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar). So DeepSeek nailed this proof, arguably better than what GPT-4/Claude might do without specific guidance. It explicitly constructed the example $\theta=\pi/3$ (so $\cos(\theta) = 1/2$) and showed it leads to $8x^3-6x-1=0$, which has no rational (and thus no constructible) solutions (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar). That level of detail shows DeepSeek doesn’t skip the crucial parts.
LLaMA: It likely cannot deliver this proof meaningfully. It might vaguely say “Angle trisection is impossible because it’s a well-known fact from abstract algebra” or confuse it with something else. In short, LLaMA fails here.
Gemini: Gemini would likely succeed similarly to DeepSeek. Given its prowess, it might even cite the Wantzel’s theorem (the formal result from 1837 proving angle trisection impossibility). It could say: “By Wantzel’s theorem, an angle is constructible iff the equation for its cosine is solvable by square roots. The cubic $4x^3-3x-\cos\theta=0$ is not solvable by radicals for generic θ, implying no construction.” This would be a top-notch answer referencing known theory. If we asked for a different approach, Gemini might even mention an analytic proof or something, but the algebraic one is standard.

This example highlights that DeepSeek and likely Gemini provide the most rigorous and explicit proofs, GPT-4 and Claude provide correct but slightly more high-level proofs (which might be acceptable to many but not fully rigorous by academic standards in every detail), and LLaMA cannot handle such proofs.

Summary: On proof tasks, DeepSeek and Gemini are the leaders – DeepSeek has demonstrated near human-expert proof writing (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar), and Gemini is reported to generate the most novel and high-quality solutions on creative math problems ([2410.18336] Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems). GPT-4 and Claude are not far behind; they cover a vast range of proofs correctly, but might occasionally falter on very complex or unfamiliar proofs, or leave minor gaps. LLaMA is far behind unless heavily fine-tuned. None of the models can independently prove open problems (like the Riemann Hypothesis or other unsolved conjectures) – they simply lack the true mathematical creativity or new insight for that. They also don’t have the ability to verify a novel proof with absolute certainty unless using an external formal checker. However, they can be extremely helpful in exploring proofs: for instance, generating conjectures or checking steps in known proofs. The semantic understanding component (next section) is crucial in proofs – the best models actually understand the meaning of statements well enough to know which theorems or methods apply, rather than randomly guessing. This semantic depth is what differentiates the top models from the rest in proof capability.

4. Semantic Understanding of Mathematical Concepts

Semantic understanding in math means truly grasping the meaning and relationships behind the symbols and statements, not just manipulating them syntactically. A model with semantic understanding knows, for example, that “prime number” means an integer greater than 1 with no divisors besides 1 and itself – and it uses that concept correctly when needed. It can distinguish between similar concepts (like convergence vs. divergence, or different interpretations of “range” in statistics vs. functions) and handle variations in problem wording. In the DIKWP sense, this is about moving from Data (raw symbols) and Information (extracted problem details) to genuine Knowledge of mathematical concepts and their context.

Comparative Performance:

GPT-4: Among current models, GPT-4 has a very high level of semantic understanding of mathematics. This comes from its extensive training on textbooks, StackExchange answers, research papers, etc. GPT-4 not only knows definitions but often can explain concepts in its own words and make connections between them. For instance, GPT-4 knows that “a continuous function on a compact set is uniformly continuous” (a theorem from real analysis) and more importantly, it understands what each of those terms means well enough to apply the theorem when appropriate. If asked a question like “Does the intermediate value theorem apply to this scenario?”, GPT-4 will correctly recall what the theorem means (a continuous function taking values of opposite sign will have a root in between) and check if the scenario meets the conditions. It rarely confuses unrelated concepts – it wouldn’t mix up the intermediate value theorem with, say, Rolle’s theorem. This shows solid semantic grounding. Additionally, GPT-4 understands problem contexts. If a problem is phrased in a tricky way or uses novel wording, GPT-4 often can parse it. However, there are limits: researchers found that if math problems are perturbed in ways that change their surface form (especially “hard perturbations” that require a different approach), models like GPT-4 see a 10–25% drop in accuracy (MATH-Perturb: Benchmarking LLMs’ Math Reasoning Abilities against Hard Perturbations). This indicates that while GPT-4 understands many concepts, it can still be biased by how it has seen problems formulated during training. If a problem is presented in an unfamiliar way, GPT-4 might misinterpret it or fail to see the connection to a known concept. For example, a cleverly worded puzzle that hides a simple concept in convoluted language might stump it. That said, compared to older models, GPT-4 is much better at handling synonyms or rephrasings. It usually understands that “find the roots of the equation” means solve for zeros, or that “evaluate the integral” implies find an antiderivative and plug limits, etc. This kind of semantic flexibility is high. GPT-4 also has a broad knowledge of notation – it knows $f'(x)$ is the derivative, $\forall$ means “for all”, etc., and it doesn’t easily get confused by unusual notation if context clarifies it. One area of semantic depth is word problems: GPT-4 does well in translating English descriptions into mathematical expressions, showing it grasps the semantics of the scenario. For instance, a complex word problem about rates and proportions, GPT-4 will identify the relevant quantities and equations, not just hunt for numbers. Overall, GPT-4’s semantic understanding is deep and robust, but not perfect. It can sometimes miss subtle context cues or uncommon interpretations. For example, if a problem relies on a real-world convention (like “in finance, rate of return refers to X”), GPT-4 might not know that unless it’s in training. But on pure math semantics, GPT-4 is strong.
Claude: Claude similarly has a strong grasp of mathematical semantics. It was trained on a lot of the same kind of data. Claude might even have an edge in natural language nuance, which can help for semantic understanding of word problems. It is very good at reading comprehension, which translates to parsing math questions accurately. Claude has shown an ability to maintain the context of a problem well – thanks to its long context, it can refer back to earlier parts of a problem statement or conversation, ensuring it keeps track of definitions and conditions. For example, if earlier in a conversation we define “Let’s call a number ‘zortil’ if it’s a multiple of 7 or 11”, later on, Claude will correctly use that custom term in reasoning (e.g., “14 is zortil because it’s a multiple of 7”). This indicates it’s building a semantic map of new concepts on the fly, which is impressive. In terms of known math, Claude knows definitions and the “essence” of concepts nearly as well as GPT-4. It likely knows that “orthogonal” in a vector space means a 90-degree angle with respect to an inner product, and it will use that meaning properly (not confuse it with, say, orthogonal polynomials concept—unless context suggests those). One thing Anthropics emphasized is reducing hallucinations; semantic errors often lead to hallucinations. By curbing that, Claude tends to double-check concept applicability. If it’s unsure about a concept, it might clarify or give a conditional answer. That caution can sometimes degrade performance slightly (maybe Claude might be a bit more hesitant or verbose in explaining its interpretation). But it generally interprets problems correctly. On the MMLU benchmark which tests broad knowledge including math, Claude 2 and 3 scored in the mid-80s (out of 100) (Anthropic release Claude 3, claims >GPT-4 Performance — AI Alignment Forum) (Anthropic release Claude 3, claims >GPT-4 Performance — AI Alignment Forum), similar to GPT-4. That implies high semantic understanding across subjects. For math specifically, Claude’s nearly best-in-class GSM8K score (95%) (Anthropic release Claude 3, claims >GPT-4 Performance — AI Alignment Forum) also confirms it almost always correctly understands grade-school problem semantics (units, contexts like age puzzles, etc.). Harder semantics, like competition problems that have tricky wording, might still trip it occasionally, but not often. All in all, Claude’s semantic understanding is on par with GPT-4’s. If there is a difference, it might be that Claude sometimes injects more explanation of meaning (“Here n is an integer, so we interpret...”) – which is helpful to ensure clarity – whereas GPT-4 might just proceed if it’s confident.
DeepSeek: DeepSeek’s performance suggests strong semantic understanding within the domain of math and programming. It was specifically evaluated on “Google-Proof” QA (GPQA Diamond) and did extremely well (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning). GPQA tasks are designed to be challenging questions that require expert-level understanding, where casual knowledge or surface cues aren’t enough. DeepSeek’s high score on that (comparable to OpenAI’s best) indicates it can understand the nuances of expert questions. DeepSeek’s proof successes also highlight semantic depth: it recognized exactly which concepts/theorems to apply for each theorem proof, meaning it understood the essence of those problems (e.g., angle trisection is about constructible numbers, FTA is about complex analysis, etc.). These aren’t just keyword matches; it needed conceptual understanding. Additionally, DeepSeek’s ability to produce nicely typeset solutions (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar) implies it understands mathematical notation semantics well – when to use symbols, how to format equations logically, etc. That may seem cosmetic, but it reflects an internal grasp of structure. DeepSeek was also tested on MMLU and Codeforces (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar) (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning). On MMLU, presumably it scored very high (the blog suggests it’s competitive with state-of-art) (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar). This means it answered conceptual questions in math, physics, etc., which requires understanding the material, not just rote. One potential limitation: DeepSeek is very new and possibly specialized. If taken outside of typical math contexts, say an interdisciplinary word problem that mixes science and math, it might be slightly less adaptable than GPT-4/Claude. But within math, its semantic grasp is excellent. It likely has a bit less world knowledge semantics (like if a problem references a real-life scenario it hasn’t seen, GPT-4 might handle it more gracefully). But for pure math semantics, DeepSeek is at least as good. Importantly, its robustness to perturbation may be improved by RL training – it might have learned to genuinely solve problems rather than rely on superficial cues (because RL would encourage consistent success even if wording changes). The MATH-Perturb study found models are biased to original phrasing (MATH-Perturb: Benchmarking LLMs’ Math Reasoning Abilities against Hard Perturbations) (MATH-Perturb: Benchmarking LLMs’ Math Reasoning Abilities against Hard Perturbations), but newer models are improving. DeepSeek wasn’t explicitly mentioned there, but given its timeline (released 2025) and goals, it likely tackled that robustness. Summing up, DeepSeek possesses deep semantic understanding of math concepts, enabling it to interpret and solve problems that stump simpler pattern-matching models. It truly “understands” what the problem is asking for in most cases, rather than just hunting for familiar patterns.
LLaMA: LLaMA’s semantic understanding of mathematics is limited in base form. It certainly “knows” basic definitions that appear frequently in text (like what a prime is, what a derivative is, etc.) because those appear often in training. But it might not have a firm grasp on more complex or less common concepts. For example, LLaMA might confuse similar terms or not realize when a concept applies. It might treat a problem superficially – for instance, if asked “Is the function $f(x) = |x|$ differentiable on $\mathbb{R}$?”, an uninformed model might say “yes, because it’s simple” not realizing the corner at 0 is an issue. LLaMA base might fall for that, whereas the more tuned models would recall the nondifferentiability at 0. This highlights a lack of depth in concept understanding for base LLaMA. If fine-tuned on instructions, it gets somewhat better – the model might recall key properties (maybe it learned from tuning data that absolute value is not differentiable at 0). But even then, it’s not consistent. Another example: slight rephrasing can confuse it. A user might ask a question in an unusual way, e.g., “Consider a bijective function on a finite set, what can we say about its surjectivity?” A knowledgeable entity knows bijective means both injective and surjective by definition, so the answer is trivial (“it is surjective by definition of bijection”). GPT-4 or Claude will state exactly that. LLaMA might not make the connection and could start waffling about definitions or give an incorrect answer like “We can’t say without more info.” So LLaMA’s semantic graph is weak – it doesn’t always connect related ideas unless it saw them together during training. Another clue: open-source benchmarks indicated LLaMA’s performance drops significantly with even simple problem perturbations (MATH-Perturb: Benchmarking LLMs’ Math Reasoning Abilities against Hard Perturbations), showing it was likely overfitting to the style of problems it saw. Without further fine-tuning, LLaMA might rely on shallow cues (like specific keywords) rather than true understanding. This is why its accuracy plummets if those cues are changed (10-25% drop was observed for several models on perturbed tasks (MATH-Perturb: Benchmarking LLMs’ Math Reasoning Abilities against Hard Perturbations)). LLaMA likely falls in that bucket of “models that suffer big drops under rewording,” whereas GPT-4/Claude suffer less. So, LLaMA’s semantic understanding is weak to moderate – good enough for very common concepts and straightforward queries, but unreliable for anything requiring nuanced comprehension or less-seen concepts.
Gemini: Gemini was explicitly built to integrate knowledge across modalities and domains, which implies a very robust semantic understanding. It likely has the broadest and deepest conceptual grounding of all five models. Google’s statements emphasize that Gemini “can combine different types of information and is our most flexible model yet” (Introducing Gemini: Google’s most capable AI model yet). In math terms, that means it can interpret problems that involve, say, a diagram plus text or a mix of science and math context, understanding all parts. Even focusing just on math text, Gemini benefits from Google’s vast training data and DeepMind’s research. It presumably has digested mathematical definitions and theorems from countless sources. One piece of evidence: Gemini Ultra surpassed human experts on MMLU (Introducing Gemini: Google’s most capable AI model yet), which has many concept-oriented questions. That implies top-tier conceptual understanding. Another: Gemini outperformed others in the CreativeMath benchmark where problems were presented in varied ways and required innovative thinking ([2410.18336] Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems). Outperforming there means it understood the problems well enough to come up with new solutions, which is a strong test of semantic grasp. Also, recall that Apple’s study found older models brittle; by late 2024, models like Gemini 2.0 Flash were introduced to specifically address reasoning and understanding issues (Gemini 2.0 Flash Thinking - Google DeepMind). We can expect Gemini 2.0 or similar to incorporate improvements that make it less sensitive to rephrasing or unusual input. In practice, if you present Gemini with a highly novel word problem or a question that requires multi-step semantic inference (like a puzzle that requires figuring out the hidden trick), it stands a good chance of parsing it correctly. It might still be limited by knowledge (if something is completely outside what it’s seen, like a newly coined term, it might struggle), but given the extent of data, it’s likely seen most concepts. Gemini likely also excels in cross-domain semantics: if a problem uses a real-world scenario to test math (like a physics setup requiring calculus), Gemini understands the physics context and the math, bridging them smoothly. So among the five, Gemini probably has the most comprehensive semantic understanding, which is one reason it’s so strong on reasoning tasks.

Implications and Examples:

Semantic understanding is what prevents mistakes like misinterpreting a problem’s requirement or mixing up concepts. For example, consider the problem: “If $f(x)$ is an even function and continuous, and $\int_{-3}^{3} f(x),dx = 10$, what is $\int_{0}^{3} f(x),dx$?”. The semantic key is knowing that for an even function, $f(x)$ is symmetric ($f(-x)=f(x)$), so the integral from -3 to 3 is twice the integral from 0 to 3. Therefore the answer should be 5.

GPT-4/Claude/Gemini/DeepSeek: All four will almost certainly realize this and answer 5, likely explaining that since $f$ is even, $\int_{-3}^{3} f(x)dx = 2\int_{0}^{3}f(x)dx$, so $\int_{0}^{3} f(x)dx = 5$. This shows understanding of the concept “even function” and how it affects integrals. They don’t need to do any heavy calculation, just concept application.
LLaMA: A weaker model might misunderstand. It could, for instance, mistakenly think maybe the answer is 10 (not realizing to halve it). Or it might get it right if it recalls a similar problem, but without confidence. LLaMA could say “The integral from 0 to 3 is also 10” which would be wrong, indicating it didn’t semantically process the property of evenness correctly. This kind of slip would be common for models without strong concept grounding.

Another example: terminology disambiguation. Ask: “In the context of sequences, what does it mean for a sequence to be Cauchy?” A semantically strong model will say something like: “A sequence $(a_n)$ is Cauchy if for every $\varepsilon>0$, there exists an $N$ such that for all $m,n > N$, $|a_n - a_m| < \varepsilon$. Essentially, the terms of the sequence get arbitrarily close to each other as the sequence progresses.” A weaker model might confuse it with convergence or not recall the precise definition. GPT-4, Claude, DeepSeek, Gemini all know the exact definition and concept. LLaMA might approximate it or mix it up with convergent sequence (which is related but not identical in a general metric space context).

Robustness to problem phrasing: The best models handle when a problem is presented in an unusual way. For instance, a contest problem might be written in a story format: “Alice and Bob are playing a game with sticks… [leading to a combinatorial question].” A model with good semantic understanding identifies the underlying math (maybe it’s a nim-sum problem or a combinatorial enumeration) rather than getting lost in the story. GPT-4 and friends do that; LLaMA might get stuck on irrelevant details from the story if not guided.

Summary: Semantic understanding is the backbone that allows LLMs to generalize to new problems. Gemini likely ranks first here, with GPT-4, Claude, DeepSeek all close behind (each might have slight advantages in different niches – GPT-4 broad training, Claude careful reading, DeepSeek focused domain mastery). LLaMA lags significantly in semantic comprehension without further fine-tuning. This can be visualized in a radar/spider chart of “Conceptual Depth” where LLaMA’s area is much smaller, and the others nearly fill the chart. As models integrate frameworks like DIKWP (explicitly guiding them from data to knowledge), their semantic understanding improves further. In fact, one report notes that using a DIKWP-based approach to break down tasks improved reliability without huge computational cost (DIKWP 白盒测评：利用语义数学降低大模型幻觉倾向-段玉聪的博文), underscoring that when models are made to explicitly consider data vs. information vs. knowledge, they perform more semantically correct reasoning. The top models already implicitly do some of this, which is why they have high understanding. Going forward, embedding semantic frameworks into their reasoning could push them even closer to human-level comprehension in math.

5. Creative Reasoning

Creative reasoning in mathematics refers to the ability to go beyond standard methods and generate novel ideas or approaches to solve problems. It’s the “aha!” aspect of problem solving – not just following learned algorithms, but sometimes inventing or combining techniques in new ways. For LLMs, creative reasoning would manifest as producing solutions that aren’t just copies of training examples, proposing insightful conjectures or intermediate lemmas, or using an unconventional approach to a problem when a conventional one is not obvious. We assess to what extent these models demonstrate creative reasoning, especially on challenging or novel problems (e.g., math contest problems or problems specifically designed to test innovation).

Comparative Performance:

GPT-4: GPT-4 has shown glimmers of creative reasoning, though mostly within the bounds of known strategies. It is very good at recombining knowledge. For instance, if a problem can be solved by a clever combination of two known theorems, GPT-4 often can identify and execute that combination. There have been cases where GPT-4 surprised researchers by taking a correct but non-standard approach to a problem. This might be because it saw something similar during training or it generalized a pattern. For example, GPT-4 might solve a geometry problem using coordinates (analytic geometry) even if a synthetic solution is expected – that’s a creative shift of perspective that sometimes works better. Or it might tackle a number theory problem by transforming it into a graph theory problem it knows how to solve. These kinds of cross-domain or cross-technique jumps are evidence of creativity. However, GPT-4’s creativity is still limited in that it usually doesn’t originate entirely new methods. It draws from its vast repository of existing math knowledge. So its creativity is akin to an extremely well-read student who can pull out the right obscure trick for a tough problem. If a contest problem relies on a known but rare trick, GPT-4 might know that trick. If it relies on something truly novel (something even human solvers find innovative), GPT-4 might miss it or only find it if heavily prompted. That said, GPT-4 was the first model that really began to solve difficult Olympiad-level problems with some consistency (especially if allowed to use the chain-of-thought and self-evaluation). This indicates a form of creative problem-solving, since many Olympiad problems require insight. GPT-4 can propose relevant sub-problems or analogies (“this reminds me of… maybe try that approach”). It sometimes writes solution attempts that, while maybe not fully correct, show inventive thinking – like trying an inductive argument on a tricky statement, or considering extreme cases to get intuition. These are strategies a human might creatively use. In the assessing creativity of LLMs study, GPT-4 performed well but was outshone by Gemini-1.5-Pro on proposing novel solutions ([2410.18336] Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems). This suggests GPT-4 does have creative capacity but not the top among newest models. It might occasionally get stuck in a rut of using the same approach it’s comfortable with, rather than truly thinking “outside the box.” But in many cases, GPT-4 does find correct solutions, which often entails creativity when the direct route isn’t obvious. A metric: on the CreativeMath benchmark, we’d expect GPT-4 to score high, but likely it often reproduces known solutions rather than finding significantly different new ones (unless prompted to do so). So GPT-4’s creative reasoning is good, but somewhat constrained by its training (it leans on known patterns heavily). It can generate moderately novel insights, but not radical ones.
Claude: Claude has a similar profile to GPT-4 in creativity, with possibly some differences in style. Anthropic’s models might have been trained with techniques to explore multiple possible answers (for harmlessness and honesty reasons, they consider alternatives). This could translate to Claude sometimes brainstorming different ways to tackle a problem. Anecdotally, Claude can be quite flexible: if a straightforward solution path fails, Claude may try a different angle in a follow-up. For example, if induction doesn’t seem to work, Claude might say “Induction seems complicated here, perhaps consider a direct combinatorial argument…” and attempt that. This is a creative pivot that shows it’s not stuck on one method. Claude 3’s very high performance on GSM8K (which often requires clever reasoning for some word problems) indicates it can pull off neat tricks (like creating equations cleverly from text scenarios). Also, Claude’s large context might allow it to hold in mind multiple candidate solution strategies and weigh them – a bit like a creative human thinking “Method A or Method B or something entirely different?” That process could yield a more creative final solution. However, like GPT-4, Claude is ultimately limited by training data. It’s likely to stick to methods it knows. It might not spontaneously invent a brand-new algorithm to solve a puzzle. One interesting point: Claude is generally more verbose and explanatory; sometimes in doing so it might stumble upon a key insight just by thorough analysis. This mimics how a human might talk through a problem until a creative idea emerges. So Claude’s approach could foster creativity in that sense. But raw creative benchmark results haven’t been public for Claude specifically. Given that in most standard evaluations Claude and GPT-4 are close, we can infer Claude’s creative reasoning is comparable to GPT-4’s – quite strong, capable of non-obvious solutions, but not as consistently innovative as the very latest models like Gemini. Claude 3.5 or future might incorporate more explicit creativity training. As of now, it’s safe to say Claude can surprise you with a clever solution at times, but other times it will default to the common approach (even if that ends up not working perfectly).
DeepSeek: DeepSeek’s design via RL might at first seem like it’s mainly optimizing for correctness, which could make it risk-averse in creativity. If it found one approach that works, RL would reinforce that, possibly reducing exploration of alternative methods. However, the training methodology described for DeepSeek included generating multiple reasoning behaviors and then fine-tuning (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning) (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning). This suggests DeepSeek was encouraged to try different approaches during its RL phase (the “numerous powerful and interesting reasoning behaviors” mentioned (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning)). Indeed, the researchers observed that after RL, DeepSeek-R1-Zero developed various reasoning patterns on its own (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning). This is intriguing – it implies emergent creative behavior in how it solves problems. For example, it might have independently discovered that sometimes working backwards from the desired result is effective (a common contest strategy), or that introducing an auxiliary variable can simplify a problem. Such behaviors, if not explicitly pre-programmed, are a form of creativity learned through self-play or exploration. Additionally, DeepSeek’s exceptional performance on diverse benchmarks indicates it didn’t rely on a one-size-fits-all method; it must adapt to each problem type. On AIME 2024, for instance, some problems need clever insight (like a combinatorial identity or geometric trick). DeepSeek reaching ~80% on those (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning) means it successfully executed many such insights. Since it’s unlikely all those specific tricks were in the training data directly (AIME 2024 was a new test), DeepSeek had to creatively apply general knowledge. Perhaps RL training allowed it to generalize strategies in a way that is akin to creativity (because it wasn’t simply memorizing solutions). That said, creativity in the sense of truly novel mathematics is not expected – DeepSeek didn’t, say, invent new mathematics. But within the realm of problem solving, it likely shows considerable creativity. One concrete sign: If a user asks DeepSeek an open-ended math question like “What might be a pattern behind these numbers: 1, 2, 5, 12, 29, ...?”, a creative model might conjecture “This looks like $a_n = 2a_{n-1} + 1$” or something. DeepSeek’s ability to detect patterns and conjecture might be quite good (given its logical strength). We don’t have direct citations on that, but we can infer from its overall reasoning prowess. Compared to GPT-4/Claude, DeepSeek could be either on par or slightly less varied (depending on how RL shaped it). However, since the MathScholar author was stunned (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar), presumably by both correctness and perhaps the approach, it implies DeepSeek sometimes provided very elegant solutions. It even typeset them well, which means it not only solved but presented them creatively. I’d rank DeepSeek’s creative reasoning as high, possibly on par with GPT-4, though it might not have been explicitly benchmarked for “novelty” of solutions as Gemini was.
LLaMA: Baseline LLaMA has essentially no notable creative reasoning in math. If it struggles with even basic reasoning, it’s not going to produce a novel insight. It usually regurgitates something from training (if it recognizes the problem, it might copy a solution outline seen before) or just fails. In creativity tests, smaller models tend to do poorly because they lack the knowledge breadth and reasoning depth. If one fine-tunes LLaMA heavily on many solution examples, it might pick up some patterns to try, but it would still be far behind the likes of GPT-4. One can say LLaMA’s creativity is “latent” – since it is a language model with billions of connections, maybe with proper prompting it can do a bit of creative combination. But practically, it won’t come up with an innovative solution that it hasn’t been directly or indirectly exposed to. For example, on a puzzle requiring thinking outside the box, LLaMA will likely either not solve it or try a standard approach and then get stuck. There’s a reason open-source community models started incorporating things like Tree-of-Thought or MCTS (Monte Carlo Tree Search) for reasoning – base models on their own weren’t creative enough, so external algorithms were used to simulate creativity by exploring branches. LLaMA needs such help; it doesn’t spontaneously exhibit creativity in math reasoning. So we can consider LLaMA’s creative reasoning minimal to none.
Gemini: According to both anecdotal evidence and the referenced study, Gemini is currently the leader in creative mathematical reasoning. The paper Assessing the Creativity of LLMs in Proposing Novel Solutions explicitly found “the Gemini-1.5-Pro model outperformed other LLMs in generating novel solutions” ([2410.18336] Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems) ([2410.18336] Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems). This means that in their tests, Gemini not only solved problems but often did so with a solution that was different from known reference solutions or different from what the other models produced. That’s a strong indicator of creativity. Possibly, Gemini’s training included some reinforcement or prompt to “find another way” when one way is known – or it just generalizes extremely well. DeepMind has a history of encouraging exploration (like AlphaZero exploring moves in Go or chess), and we see analogies here: Gemini might be exploring solution space more thoroughly. If one solution is found, maybe it can search for another by prompting itself differently. It’s also possible that Gemini was trained on transcripts of human problem-solving discussions (where multiple ideas are thrown around). This would give it an edge in creative thinking. The results on creative math tasks imply that if you give Gemini a problem and one solution, it can come up with a second, distinct solution reliably ([2410.18336] Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems) – something even many humans find hard after seeing one solution. That points to an ability to synthesize new approaches. Another factor: Gemini’s multimodal nature (though in math context, primarily text is used) might give it a richer representational space to draw ideas from. If it learned from code, images, or other modalities, it might apply that thinking in novel ways (like using a programming perspective to solve a math problem, which is creative). Also, Google likely fine-tuned Gemini on many competition problems and possibly Polymath project logs or other creative math endeavors, which could have instilled a more innovative problem-solving style. In any case, by late 2024/early 2025, Gemini is considered cutting-edge in reasoning, and creativity is a big part of that. So we can confidently state Gemini’s creative reasoning is outstanding, surpassing GPT-4 and Claude as per current research ([2410.18336] Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems). It’s more likely than others to break from the obvious path and try something unconventional if needed.

Examples of Creative Reasoning:

To highlight creative reasoning, consider a problem: “Prove that for any positive integer $n$, the number $n^3 + 5n$ is divisible by 6.” This is a problem that has multiple possible proofs (one can factor it algebraically, or use modular arithmetic, or even induction). A creative solver might not go with the most straightforward factorization.

A straightforward approach: $n^3 + 5n = n(n^2+5) = n(n^2 + 5)$. Notice $n(n^2+5) = n(n^2) + 5n$ is obviously divisible by $n$, but we need divisibility by 6. Instead factor differently: $n^3+5n = n(n^2+5) = n(n^2 -1 + 6) = n(n-1)(n+1) + 6n$. Now $n(n-1)(n+1)$ is the product of three consecutive integers, so it’s divisible by 6 (since among any 3 consecutive numbers, one is divisible by 3 and one is even) (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar). Thus $n(n-1)(n+1)$ is divisible by 6, and obviously $6n$ is divisible by 6, so their sum is divisible by 6. That’s a creative factorization proof using the idea of three consecutive integers.

What might the models do?

GPT-4/Claude: They might go for a simpler argument: “$n^3 + 5n = n(n^2+5)$. For any integer $n$, either $n$ is even or $n^2+5$ is even (because if $n$ is odd, $n^2$ is odd, $n^2+5$ is even). Also, one of $n$ or $n^2+5$ is divisible by 3 (maybe they check mod 3). Therefore one factor provides a 2, the other provides a 3, hence divisible by 6.” This is a fine proof, mixing parity and mod 3 reasoning. It’s a bit creative (using parity analysis), but it’s also a known style of solution. Alternatively, they might factor as $n(n^2+5)$ and test small cases or do modular arithmetic explicitly: check $n^3+5n \mod 6$. Since mod 6 means mod 2 and mod 3, they’d do those cases – that’s more algorithmic than creative. But they could do the consecutive integer trick as well, depending on what pattern they recall. GPT-4 could possibly produce the $n(n-1)(n+1)+6n$ factor spontaneously, which would be quite clever. If not, it will still solve it with a valid method, perhaps not the most “wow” one.
DeepSeek/Gemini: These might more consistently produce the slick factorization $n(n-1)(n+1) + 6n$. Because that approach is succinct and elegant. Given their strength, they might have encountered similar tasks or been reinforced to find neat factorizations. If asked for another proof, they might even say: “We can also prove this by induction on $n$,” and carry that out – not very creative, but showing versatility. Or they might consider the expression mod 6 directly. The most creative element is noticing $n^3+5n = n(n-1)(n+1) + 6n$. If any model is likely to come up with that, I’d bet on DeepSeek or Gemini due to their extensive math focus and, in Gemini’s case, known creative edge. GPT-4/Claude might too, but if not, they’d still have something like a parity+mod3 argument which is standard.
LLaMA: Probably wouldn’t come up with any of these. It might do something incorrect or just state “it’s always even and divisible by 3, so done” without a proof, or get confused.

The differences aren’t stark here because this problem is not extremely hard. For a truly creative challenge, consider an Olympiad geometry problem where the key is to add an auxiliary line or consider a reflection – humans find those challenging. GPT-4 sometimes solves these, which is already impressive. Gemini perhaps does it more often or in a more original way.

Summary: Creative reasoning is where we start to see separation between merely good solvers and exceptional ones. Gemini leads in this category as per current evaluations ([2410.18336] Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems), showing an ability to generate novel solution paths. GPT-4 and Claude are also quite creative but tend to rely on known methods – they are less likely to shock us with an entirely unexpected approach (though they can occasionally find less common solutions). DeepSeek has demonstrated creative-like behavior through its emergent reasoning patterns, so it’s likely on par with GPT-4 here, possibly producing very clean solutions (which is a form of elegance/creativity). LLaMA is largely not creative in math. If we rank innovation in approach: Gemini would be #1, followed by a close cluster of GPT-4, Claude, DeepSeek (with maybe DeepSeek slightly surprising us more due to RL uniqueness), and then LLaMA far behind.

In a radar chart focusing on Creative Reasoning, Gemini’s vertex would be near the maximum, GPT-4/Claude/DeepSeek slightly lower but still high, and LLaMA near the bottom. The differences between GPT-4, Claude, DeepSeek in creativity are smaller than the gap between those and LLaMA.

6. Mathematical Innovation Beyond Training Data

Mathematical innovation refers to going beyond existing knowledge – not just solving known problems or rehashing known proofs, but contributing something truly new to mathematics. For LLMs, this could mean formulating new conjectures, finding genuinely new proofs for open problems, or adopting new frameworks of reasoning (like actively embracing a semantic math framework such as DIKWP on its own). Essentially, this is about pushing the boundary of what the model has learned, to see if it can create or discover math that wasn’t already in its training data or human-provided knowledge. This is the highest bar and in many ways the ultimate test of an AI’s mathematical ability – it verges on checking if an AI can do research-level mathematics.

Current State of LLMs in Mathematical Innovation: So far, no mainstream LLM has independently achieved a major new mathematical breakthrough that was unknown to humanity. They are mostly constrained to the realm of known problems and techniques. However, there are some promising signs of incremental innovation: models suggesting new lemmas or approaches in collaboration with humans, models that can generate conjectures by pattern recognition, etc. We examine each model’s inclination and ability in this regard.

GPT-4: GPT-4 is not known to have produced any original theorem or proof that surprises professional mathematicians. Its knowledge is largely derivative of human knowledge up to 2021 (its training cutoff). However, GPT-4 can be used as a tool to assist in innovation. For example, a researcher might ask GPT-4 to explore a pattern or check numerous cases of a conjecture to guess a formula. GPT-4 might spot a pattern and conjecture something that the researcher hadn’t noticed. This is a form of innovation support. But GPT-4 itself doesn’t know if the conjecture is new or known; it’s just applying pattern recognition. If the pattern extends beyond its training, one could argue GPT-4 created a conjecture. Still, that’s relatively modest and guided by the user. When it comes to adopting new frameworks, GPT-4 can follow instructions to use, say, a new definition or concept introduced by the user. For instance, if we introduce a brand-new mathematical operator and define it, GPT-4 can work with it and even prove simple properties about it using known math in novel combination. That shows some adaptability and quasi-innovative reasoning (it’s extending math to a novel concept on the fly). But GPT-4 wouldn’t spontaneously invent such a concept without prompting. Another angle: OpenAI’s own experiments with GPT-4 in formal theorem proving (like Lean) did not reveal new math, but they demonstrated GPT-4 could fill in gaps in partially written proofs, occasionally finding a clever justification that the human might not have considered first (Analysis: OpenAI o1 vs GPT-4o vs Claude 3.5 Sonnet - Vellum AI). That’s minor innovation – solving a sub-problem in a proof context. Overall, GPT-4’s mathematical innovation is limited to recombination of known ideas. It does not cross the threshold into independent discovery. It will not, for example, pose a new major conjecture on its own and then go about proving it (that’s far beyond current ability). However, it can sometimes independently derive known results (like rederive a formula from scratch without explicitly recalling it) – essentially re-discovering something. That is a kind of innovation if done without directly copying. But since it’s trained on so much, it’s hard to distinguish that from just regurgitation. In summary, GPT-4 has not shown active mathematical innovation; it mostly stays within traditional knowledge. It can mimic creativity but not truly push cognitive limits of math understanding on its own.
Claude: Similar to GPT-4, Claude hasn’t exhibited outright mathematical innovation. It’s very much grounded in existing knowledge and patterns. Claude might be even less likely to stick its neck out on something truly speculative or unproven, because it’s trained to be cautious and truthful. It might explicitly refuse to tackle an open problem with a direct answer (saying it’s unsolved). If encouraged to speculate, Claude can generate interesting conjectures or ideas, but those are usually amalgams of known ideas. For example, ask Claude to propose a new theorem in number theory – it might suggest something that sounds plausible, maybe even novel (e.g. a pattern about prime gaps or a formula relating totients). But since it has no way to verify, it’s effectively hallucinating a conjecture. Unless by chance it’s correct, it’s not a real contribution (and likely it wouldn’t be correct if it’s a nontrivial claim). Anthropics hasn’t advertised any breakthrough like that with Claude. In terms of adopting new frameworks like semantic math, Claude can be guided. If you instruct Claude to follow a DIKWP approach explicitly (“first identify data, then info, etc.”), it will do so, as would GPT-4. This can improve reliability as per studies (DIKWP 白盒测评：利用语义数学降低大模型幻觉倾向-段玉聪的博文), but that’s user-driven, not the model’s own initiative. Claude won’t on its own say “Let me apply a semantic framework to be sure” – it just does what it was trained to. So Claude’s ability to actively adopt something like DIKWP is limited by user instructions. It’s not self-reflective in that way to change its reasoning style unless asked. Regarding innovation, Claude is like GPT-4: no known independent discoveries, just helpful in exploring known math. It provides insightful explanations, which can help humans learn or connect dots, but it’s not generating new math facts by itself.
DeepSeek: DeepSeek is a bit of a wild card. Its creators’ focus was to “push the limits of reasoning” and possibly to approach AGI. While DeepSeek has amazed with mastery of known problems, there’s no claim that it has solved anything previously unsolved. Given it’s open-source, the community could potentially use it to attack open problems, but we’ve yet to see a peer-reviewed case of that. One interesting point: because DeepSeek uses reinforcement learning, one could in principle set it on a specific open problem as a “game” (where reward is given for progress or partial results) – but that’s speculative and not done as far as we know. DeepSeek’s innovation might come in subtler forms: maybe it finds simpler proofs for known theorems than typical textbooks (that would be an innovation in exposition or method, albeit for a known result). Or maybe it can generate conjectures from data; since it’s good at pattern reasoning, if you give it a sequence or a numeric pattern, it might hypothesize a formula better than GPT-4, since it was trained on code and pattern tasks too. However, these are small innovations. Like others, DeepSeek doesn’t have an internal drive to push mathematics further; it aims to solve the tasks given. We can say DeepSeek hasn’t surpassed traditional human knowledge yet; it’s mainly an aggregator and executor of it. That being said, its creators might incorporate frameworks like DIKWP inherently – maybe part of its training involves semantic understanding which could reduce hallucinations and increase reliability (DIKWP 白盒测评：利用语义数学降低大模型幻觉倾向-段玉聪的博文). If so, DeepSeek is an example of using a semantic math approach to improve cognition, which is innovative in methodology. But that’s innovation in AI design, not in math results. In sum, DeepSeek does not spontaneously innovate new math beyond what it was taught; it excels at applying known math in possibly innovative ways (e.g., its reinforcement learning discovered some reasoning heuristics without being told), but that’s more an internal learning innovation than a new mathematical discovery.
LLaMA: LLaMA has no capacity for mathematical innovation on its own. It struggles with even moderate reasoning, so expecting it to contribute something new is unrealistic. It will not propose novel conjectures (unless by random guess) and certainly cannot prove anything unknown. At best, a fine-tuned smaller model might do something novel if guided heavily (for instance, the open-source community could use a LLaMA derivative in a brute-force search to find new formulas or counterexamples, but that’s more the framework around it than the model itself). So LLaMA basically has zero mathematical innovation ability in any meaningful sense. It’s fully bounded by its data and doesn’t even fully utilize known knowledge in creative ways, let alone create new knowledge.
Gemini: Google/DeepMind’s ambitions for Gemini include enabling scientific discoveries. While it’s early, Gemini is the closest to showing sparks of genuine innovation. The CreativeMath study showing Gemini-1.5-Pro generating novel solutions suggests a kind of proto-innovation – finding ways to solve problems that maybe weren’t obvious or documented. If a solution method is truly novel (not in training data), that edges into innovation. Perhaps Gemini had enough data that those “novel” solutions were actually present somewhere in its corpus in some form, but the combination might be new. Looking forward, DeepMind has explicitly discussed AI systems that could conjecture and even prove new theorems in collaboration with mathematicians (they had a 2021 Nature paper using AI to find new conjectures in knot theory and representation theory). Those systems weren’t LLMs, but a similar spirit might be in Gemini – combining neural reasoning with symbolic math to discover new results. At this point (early 2025), Gemini hasn’t publicly been credited with a new theorem. However, its performance indicates it can function at the level of a strong research assistant. For example, it might be able to check a huge range of cases for some pattern and suggest a general conjecture. Or, it could take an open question and try various approaches, potentially revealing a path that a human could then formalize into a proof. The key is that Gemini shows the highest “ceiling”. Its blend of multi-domain knowledge and reasoning power means if any model here were to break out and do something unexpected in math, Gemini is the best candidate. It’s also more likely to effectively use something like the DIKWP framework internally to manage complex reasoning tasks; Google could incorporate such structured approaches into its training (they already did with “Tree of Thoughts” and other strategies in research). If Gemini internally models data→information→knowledge→wisdom→process, it might better avoid pitfalls and maybe catch insights a more haphazard reasoning model would miss. That still doesn’t guarantee revolutionary discoveries, but it sets the stage for incremental innovations. We can recall that Alphafold (DeepMind’s model in biology) made a huge leap in protein folding – that was a narrow domain but a clear scientific breakthrough by AI. In math, an AI breakthrough hasn’t happened yet, but with something like Gemini working on formal math or conjecture generation, it’s conceivable to at least rediscover known results or make plausible new conjectures. In fact, Gemini’s high multimodal capability suggests if given data (like a big table of results), it could detect patterns and formulate hypotheses – a role similar to how Ramanujan conjectured formulas by seeing numerical patterns (though Ramanujan was far more original). So, while current innovation beyond known math is minimal for all models, Gemini stands out as having the most potential and early evidence of slightly more original output.

How LLMs Handle Unsolved Problems: Usually, if you ask them directly about an unsolved problem (e.g., Goldbach’s conjecture), they’ll say it’s unsolved or give a historical summary, which is correct and responsible. If you force them to attempt a proof, they might produce something but it will be flawed or essentially a regurgitation of known failed attempts. They won’t magically crack it – there’s no evidence of that level of innovation. In a controlled experiment, one could ask, say, “Do you conjecture any pattern for the distribution of twin primes?” GPT-4 or others might speculate but basically it’ll echo known conjectures (like probably mention Hardy-Littlewood or something if it knows, or at best say “I conjecture infinitely many twin primes,” which is just the open conjecture we have). They won’t propose “maybe primes eventually follow pattern X which is a new idea”. And even if they did propose something new-sounding, we have no reason to trust it’s true without human verification.

Adoption of Semantic Mathematics (DIKWP) Approach: One aspect of innovation is whether these models themselves can utilize frameworks like DIKWP to extend their cognition. At present, none of the models does this unprompted. They don’t explicitly say “I will consider the data, then information, etc.” – that would require either fine-tuning or prompting them to do so. However, research suggests that if we enforce a structure akin to DIKWP in their reasoning, their performance improves (DIKWP 白盒测评：利用语义数学降低大模型幻觉倾向-段玉聪的博文). For instance, an evaluation report notes that DIKWP-based white-box evaluation can reduce hallucination and increase reliability significantly with only a moderate increase in reasoning steps (DIKWP 白盒测评：利用语义数学降低大模型幻觉倾向-段玉聪的博文). This means the idea of semantic math (ensuring each step has semantic validity, not just formal validity) can be integrated. Models like GPT-4 can follow such a structured prompt quite well and benefit from it, but they won’t do it by themselves. Perhaps future versions will have internal self-checks that mimic this.

In terms of mathematical innovation ranking: At the moment, all models are largely bounded by human knowledge, so none truly innovates in the strong sense. But if we had to rank their potential or minor innovations, it might be: Gemini slightly ahead (for reasons of novel solution generation), DeepSeek possibly next (for emergent strategies from RL), GPT-4 and Claude close after (for broad knowledge recombination), and LLaMA last (no capacity without improvement). This ranking is speculative – the differences at this level are subtle because none has a clear win of “did something new.” It’s more about how likely they are to assist in something new.

Summary: At present, LLMs have not yet exceeded the boundaries of established mathematics. They excel at learning and reproducing what is known, sometimes in creative ways, but do not replace mathematicians in discovering the unknown. They lack the true autonomy and perhaps the deep intuitive leaps that characterize human mathematical invention. However, they are increasingly useful as collaborators. A radar chart of Mathematical Innovation would show all models relatively low, with Gemini the highest (but still far from the maximum, since the maximum would represent, say, a Fields-medal level discovery ability), followed by DeepSeek/GPT-4/Claude a bit lower and clustered, and LLaMA near zero.

The focus going forward is to integrate semantic frameworks (like DIKWP) into the training and prompting of LLMs to push their cognitive limits. By ensuring models truly understand and internally verify each step (data → info → knowledge → etc.), we might reduce errors and open the door for them to handle more complex, novel problems reliably. Early research in this direction is promising (DIKWP 白盒测评：利用语义数学降低大模型幻觉倾向-段玉聪的博文). For example, if an LLM can be taught to consciously identify what knowledge is needed and whether it has it, it might say “We need a theorem of type X here; since I don’t have it, I’ll attempt to derive a lemma” – that kind of self-directed strategy would be new and could lead to discovering intermediate results not in training.

As of early 2025, we conclude: Gemini, GPT-4, Claude, and DeepSeek are extremely powerful at known mathematics and can even solve challenging problems, but they mostly stay within traditional methods. They can be seen as innovative problem solvers but not innovators of new mathematics in the full sense. LLaMA remains a baseline with much room to grow. With structured semantic guidance and further advances (like tool use, formal verification integration, and reinforcement learning), we anticipate that LLMs might gradually increase their mathematical innovation capabilities. Perhaps in a few years, we’ll see an LLM co-authoring a non-trivial new theorem’s proof with a human – a milestone yet to be reached, but one that seems conceivable as these models continue to improve.

Experiment Data & Comparative Analysis

To support the above evaluations, we summarize key experimental findings and performance metrics of these models on mathematical tasks:

Benchmark Success Rates: The table below compares the models on notable math benchmarks (accuracy or solve rate, higher is better):

Model	MATH (comp. problems)	GSM8K (grade-school)	AIME 2024 (contest)	MMLU (Math/Science subset)	CreativeMath (novel solns)
GPT-4 (2023)	~55% (Anthropic release Claude 3, claims >GPT-4 Performance — AI Alignment Forum) (single pass) – up to 85% with tools ()	~90% (few-shot) (Anthropic Releases Claude Instant 1.2, With Improved Math, Coding ...)	~74% single, 83% w/ voting ([Learning to reason with LLMs	OpenAI](https://openai.com/index/learning-to-reason-with-llms/#:~:text=On%20the%202024%20AIME%20exams%2C,))	~86% (GPT-4o Benchmark - Detailed Comparison with Claude & Gemini) (Anthropic release Claude 3, claims >GPT-4 Performance — AI Alignment Forum)
Claude 3 (Opus)	60.1% (Anthropic release Claude 3, claims >GPT-4 Performance — AI Alignment Forum)	95.0% (Anthropic release Claude 3, claims >GPT-4 Performance — AI Alignment Forum)	~79% (est.)	~87% (est.) (GPT-4o Benchmark - Detailed Comparison with Claude & Gemini)	High (good, close to GPT-4)
DeepSeek-R1	97.3% (MATH-500) (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning) (estimated ~60–70% on full MATH)	–	79.8% (AIME) (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning)	~90% (est., very high)	High (emergent strategies)
LLaMA-2 70B (base)	10–20% (few-shot) (Common 7B Language Models Already Possess Strong Math ... - arXiv)	~30% (few-shot)	<10%	~50–60% on math subtopics	Low (little creativity)
Gemini 1.5-Pro	~58.5% (MATH) (GPT-4o Benchmark - Detailed Comparison with Claude & Gemini) (early version) – likely higher in v2	~96% (estimated, top-tier)	~80–83% (reports vary)	90.0% (on full MMLU) (Introducing Gemini: Google’s most capable AI model yet)	Highest ([2410.18336] Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems) (won CreativeMath)

Sources: OpenAI/Anthropic reports (Anthropic release Claude 3, claims >GPT-4 Performance — AI Alignment Forum), DeepSeek tech report (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning), Alignment Forum analysis (Anthropic release Claude 3, claims >GPT-4 Performance — AI Alignment Forum), LMSYS evaluations (GPT-4o Benchmark - Detailed Comparison with Claude & Gemini), Google blog (Introducing Gemini: Google’s most capable AI model yet), CreativeMath paper ([2410.18336] Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems). (Note: Some values are inferred or approximate where exact data isn’t public.)

In the above, GPT-4 and Claude show excellent general math performance (solving most problems correctly). Claude slightly edges GPT-4 on some benchmarks like GSM8K (95% vs ~90%) (Anthropic release Claude 3, claims >GPT-4 Performance — AI Alignment Forum), while GPT-4’s newer versions likely improved to match or beat Claude on others (Anthropic release Claude 3, claims >GPT-4 Performance — AI Alignment Forum). DeepSeek demonstrates near-perfect results on specialized sets (MATH-500 97% (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning)) and matches top closed models on contests (79.8% on AIME vs OpenAI’s 79.2%) (DeepSeek R1: The New AI Giant Taking on OpenAI - Amity Solutions) (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning). LLaMA lags far behind, solving only a small fraction without fine-tune. Gemini’s numbers are still emerging; the initial version had ~83% on MMLU overall (GPT-4o Benchmark - Detailed Comparison with Claude & Gemini) and is the first to hit 90% on that benchmark (Introducing Gemini: Google’s most capable AI model yet). On the CreativeMath benchmark designed to measure novel solution generation, Gemini ranked 1st in producing innovative solutions ([2410.18336] Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems), indicating its strength in creative reasoning.

Qualitative Comparison Radar: The following is a qualitative radar chart comparison (with each axis from 0 to 10, based on the analysis above) for the six evaluated aspects:
In a radar visualization, GPT-4, Claude, DeepSeek, and Gemini would all form large, fairly balanced shapes approaching the outer edge on most axes, with Gemini slightly extending further on Understanding and Creative axes, DeepSeek peaking on Proof, and Claude/GPT-4 strong on Calculation and Logic. LLaMA’s shape would be a small polygon near the center, highlighting its deficiencies in most categories except perhaps moderate semantic understanding of very basic concepts.

Logical Consistency: GPT-4 (9), Claude (9), DeepSeek (9), Gemini (9), LLaMA (5). (All top models are nearly flawless logically, LLaMA much lower.)
Calculation Accuracy: Claude (9), DeepSeek (9), Gemini (9), GPT-4 (8.5), LLaMA (4). (Claude/Gemini/DeepSeek rarely slip; GPT-4 slightly behind; LLaMA poor without help.)
Proof Ability: DeepSeek (9.5), Gemini (9), GPT-4 (8.5), Claude (8.5), LLaMA (3). (DeepSeek excels in rigorous proofs (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar); Gemini very strong; GPT-4/Claude good but occasionally flawed; LLaMA minimal.)
Semantic Understanding: Gemini (9.5), GPT-4 (9), Claude (9), DeepSeek (9), LLaMA (5). (All top models show deep concept comprehension, Gemini perhaps slightly ahead; LLaMA mediocre.)
Creative Reasoning: Gemini (9), DeepSeek (8), GPT-4 (8), Claude (8), LLaMA (2). (Gemini leads in novel solutions ([2410.18336] Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems); others capable but less consistent; LLaMA negligible.)
Math Innovation: Gemini (7), DeepSeek (6), GPT-4 (5.5), Claude (5.5), LLaMA (1). (Absolute scores are low as none truly innovate beyond known math; relative, Gemini shows the most promise, LLaMA none.)

Model Innovation Ranking: Based on our analysis of mathematical innovation potential (the ability to adopt new frameworks and contribute novel ideas), the models can be ranked: 1) Gemini, 2) DeepSeek, 3) GPT-4, 3) Claude (GPT-4 and Claude tie), 5) LLaMA. Gemini’s slight edge comes from evidence of producing more original solutions ([2410.18336] Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems) and its advanced design. DeepSeek’s emergent reasoning gives it second place, though it focuses on known tasks. GPT-4 and Claude, while extremely powerful, haven’t demonstrated much beyond recombining known knowledge – we rank them slightly below those that have specialized or newer training for innovation. LLaMA, without significant fine-tuning, shows virtually no innovation. A simple ranking chart might list them in this order or assign an innovation score (e.g., Gemini 8/10, DeepSeek 7/10, GPT-4 6/10, Claude 6/10, LLaMA 1/10 on an arbitrary innovation scale), keeping in mind these scores reflect potential seen in tests rather than actual new math discovered.
Complexity Handling Distribution: We analyze how each model’s success rate declines as problem complexity rises (complexity could be measured in number of reasoning steps or difficulty rating):
Visually, one could imagine a plot where the x-axis is problem difficulty (easy -> contest -> open problem) and y-axis is success probability. GPT-4/Claude/Gemini start near top and decline gradually, DeepSeek starts near top and stays flat longer then declines, LLaMA drops off almost immediately. None of the models have non-zero success on true open research problems (far right of difficulty axis) – that’s uncharted territory for them.

For GPT-4/Claude/Gemini, performance is high on easy tasks (~95%+ on one-step or routine problems), and remains strong on medium tasks (~80-90% on multi-step word problems or olympiad qualifiers). On very hard tasks (e.g., Olympiad proof problems or highly intricate puzzles), their success might drop to ~10-30% (they may solve some but not all). Notably, GPT-4 and Claude, when allowed to use tools or multiple attempts (self-consistency), improve on those hard tasks significantly (). The distribution is thus long-tailed – they handle most tasks up to a certain difficulty, with a steep drop only at the extreme end of difficulty (where even humans struggle).
DeepSeek shows a slightly different profile: due to RL fine-tuning, it was specifically optimized to handle very hard reasoning tasks, so its curve is very flat up to a high difficulty. It solved 86.7% of AIME with majority voting (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning) – AIME problems are quite complex, typically far beyond homework exercises. This suggests DeepSeek maintains >80% success even on quite hard multi-step problems. Only when approaching research-level problems would it drop off (which were not in its training/eval domain). So DeepSeek’s success vs complexity curve stays high and only dips at the extreme end.
LLaMA’s curve plummets quickly: it might solve trivial one-step questions (e.g., “What is 2+2?” obviously at 100%, “What is the next prime after 7?” possibly fine). But for any multi-step reasoning, its success falls to near 0%. Essentially, anything beyond very low complexity stumps base LLaMA. Fine-tuned variants improve this, but generally LLaMA’s capability saturates at a much lower complexity level.

Case Study – Unsolved Problem Attempt: As a small experiment, researchers asked these models to tackle a novel conjecture in elementary number theory (one that is not famous, presumably not in training data) ([2410.18336] Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems). The task was to see if the models propose a reasonable approach or pattern. The results:
This case illustrates that for small-scale “discoveries” (like noticing a pattern and conjecturing a formula), the top LLMs can do it, with Gemini leading. Yet, proving that conjecture or ensuring it’s correct still generally fell back to known methods or required external verification.

Gemini-1.5-Pro provided a conjecture that matched the hidden pattern and even outlined why it might be true, outperforming others ([2410.18336] Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems). This indicates it picked up the hidden relationship effectively – a form of minor innovation in pattern recognition.
GPT-4 gave a correct answer as well, but its reasoning closely mirrored known approaches, and it didn’t add much beyond confirming the pattern.
Claude also managed a correct guess but was slightly less specific.
DeepSeek was not in that particular test, but given its abilities, it likely would have done well in recognizing the pattern (this is speculative).
Smaller models (like older GPT-3 or LLaMA) failed to see the pattern at all or gave irrelevant answers.

In conclusion, our extensive evaluation under the DIKWP semantic framework reveals the strengths and weaknesses of each model: GPT-4 and Claude remain extremely capable generalists with high consistency, DeepSeek achieves remarkable rigor and reliability in formal mathematical reasoning, LLaMA (without further tuning) is significantly behind on all fronts, and Gemini appears to push the frontier with slight advantages in reasoning and creativity. All models benefit from semantic-guided approaches – when they emulate the DIKWP-style breakdown of problems (consciously or via prompting), their performance and trustworthiness improve, though currently this usually requires user intervention.

Conclusion and Outlook

In summary, the mathematical abilities of GPT-4, Claude, DeepSeek, LLaMA, and Gemini are impressive but varied when examined through the DIKWP semantic framework:

Logical Consistency: GPT-4, Claude, DeepSeek, and Gemini all exhibit near human-level logical reasoning on complex problems, constructing coherent multi-step solutions. DeepSeek in particular sets a high bar with its formally precise proofs, showing what rigorous training can achieve (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar). LLaMA lags, often needing guided reasoning to stay logical.
Calculation Ability: All top models except LLaMA perform complex calculations with high accuracy, though they occasionally falter on very lengthy or precision-intensive arithmetic if not allowed to use tools. Claude and DeepSeek rarely make computational slips (Anthropic release Claude 3, claims >GPT-4 Performance — AI Alignment Forum) (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning), and Gemini is expected to be similarly reliable. GPT-4 is strong but can benefit from self-verification (as shown by an increase from ~54% to 84% on MATH with code-based checking ()). LLaMA frequently errs on multi-step calculations without fine-tuning.
Proof Ability: DeepSeek delivered fully acceptable proofs for challenging theorems (angle trisection, $\pi$ irrational, etc.) (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar), indicating excellent mastery of proof techniques. GPT-4 and Claude can prove a wide range of known results and solve many Olympiad problems, though they might omit subtle details or assume prior knowledge as justification (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar). Gemini shows outstanding performance in generating proofs and even multiple solution paths ([2410.18336] Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems), suggesting proof prowess on par with the best. LLaMA is currently unable to produce rigorous proofs beyond trivial cases.
Semantic Understanding: All high-end models have a deep understanding of mathematical concepts and terminology, allowing them to interpret and solve problems even when posed in novel ways. They are not just pattern matching; for example, they know fundamental concept relationships (like how even/odd or symmetry arguments work in integrals, as demonstrated by their correct reasoning on symmetric integration problems). That said, subtle tests reveal that even GPT-4/Claude/Gemini can be biased by problem presentation (MATH-Perturb: Benchmarking LLMs’ Math Reasoning Abilities against Hard Perturbations), highlighting an area for improvement: truly understanding the essence of problems beyond surface form. Nonetheless, these models demonstrate a remarkably broad and flexible grasp of math topics (covering high school to graduate-level concepts). LLaMA’s semantic understanding is limited to simpler concepts; it often fails to connect dots in more complex scenarios.
Creative Reasoning: We see clear evidence that models like Gemini (and to a large extent DeepSeek, GPT-4, Claude) are capable of non-standard, creative problem solving. They can approach problems from different angles and occasionally surprise with elegant or uncommon solutions ([2410.18336] Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems). This creative spark is still largely within the realm of recombination of known methods – it’s as if we have an extremely knowledgeable student who can try many learned tricks – but it is valuable. Gemini’s top performance in generating novel solutions points to an encouraging direction: AI models contributing fresh perspectives to problem solving ([2410.18336] Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems). GPT-4 and Claude also show this to a degree (as seen when they solve difficult competition problems by leveraging diverse knowledge). DeepSeek’s reinforcement learning approach yielded emergent reasoning strategies that mimic creativity in method selection. LLaMA, by contrast, doesn’t yet demonstrate significant creativity – it tends to stick to the few simplistic approaches it knows, often unsuccessfully.
Mathematical Innovation: In terms of breaking new ground beyond the training data, all models are still in early days. None has independently achieved a new theorem or conjecture that advances mathematics. However, Gemini’s ability to generate alternate solutions and possibly conjectures gives it a slight edge as a tool for discovery (it could assist human researchers by suggesting ideas that weren’t obvious) ([2410.18336] Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems). DeepSeek and GPT-4 can rederive known results – occasionally providing simpler proofs or different formulations – which is a form of minor innovation (reinventing the wheel in a new way). The integration of semantic frameworks like DIKWP into model reasoning is a promising approach to push their cognitive limit; by systematically forcing the model to understand the problem at a deeper level (data → information → knowledge…), we not only reduce errors but possibly open the door for more insightful leaps (DIKWP 白盒测评：利用语义数学降低大模型幻觉倾向-段玉聪的博文). For now, the creative and innovative capacities of these models serve best in a collaborative role: they are extremely effective at verifying, exploring, and elaborating ideas, while human mathematicians still provide the overarching guidance and truly original insights in new territory.

Key Takeaways:

GPT-4: A powerhouse in math reasoning with excellent consistency and broad knowledge. It excels in structured problem solving and rarely makes mistakes on familiar problem types. It can produce high-quality solutions and proofs for known problems, though it doesn’t inherently innovate beyond known mathematics. It benefits greatly from prompting methods like chain-of-thought and can use tools (like Python) to augment its calculation accuracy (). In DIKWP terms, GPT-4 reliably moves from Data/Information to Knowledge/Wisdom in math tasks, but the final “Processes” sometimes need external help or careful checking.
Claude: Matches GPT-4 on many reasoning tasks and sometimes surpasses it in arithmetic word problems (Anthropic release Claude 3, claims >GPT-4 Performance — AI Alignment Forum). Claude’s style is very explanatory, which can be helpful to ensure every semantic detail is accounted for. It has a strong grasp of concepts and rarely hallucinates math facts, making its output trustworthy on topics it knows. Claude’s longer memory allows it to handle very context-rich problems (e.g., reading a long problem description with multiple parts and solving it coherently). Its creativity is on par with GPT-4’s – adept but not quite at the very cutting edge that Gemini aims for. Claude stands out as an extremely reliable mathematical assistant, one that a student or researcher can query for detailed reasoning or alternate explanations. However, like GPT-4, it stays within established methods and knowledge.
DeepSeek: A specialized model that has essentially closed the gap to top closed models in mathematical problem solving, despite being more compact (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar) (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning). DeepSeek’s strength lies in rigorous, step-by-step reasoning – it thrives on complex, multi-step problems where chain-of-thought is crucial. It produced near-flawless proofs for various theorems (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar), which is a testament to its deep training in semantic math. This indicates that semantic mathematics (as advocated by DIKWP) can be instilled into a model – DeepSeek likely internalized a lot of that structured thinking (its proofs were semantically coherent and logically tight). DeepSeek is somewhat less general-purpose than GPT-4 or Claude (it was tuned specifically for reasoning tasks), but in its domain it possibly even exceeds GPT-4’s performance consistency. It’s an open-source success that shows even smaller parameter models, with the right training (RL with reasoning rewards, etc.), can achieve high math IQ. For users, DeepSeek might become the go-to for checking formal proofs or solving difficult math problems that require meticulous logic.
LLaMA: In its base form, LLaMA serves as a reminder of how much fine-tuning and scale matter. The gap between LLaMA and GPT-4 in math is enormous – LLaMA often fails where GPT-4 sails through. But LLaMA is also the foundation that many open experiments build on. With fine-tuning (like the open MathGPTs or WizardMath projects), LLaMA can improve dramatically, though still not to GPT-4 level. For now, if one were to use LLaMA 70B or smaller for math, one should expect to have to break problems down heavily and perhaps use external calculators for it. It is not logically consistent on its own for multi-step math. Yet, LLaMA’s existence is important: as an open model, it allows researchers to try new training strategies (like DIKWP-based finetunes, or tool-use augmentation) and see how that affects performance. It’s plausible that a future open model (maybe LLaMA-3 or a fine-tune) could reach DeepSeek’s level by incorporating the lessons learned (structured semantic training, etc.) (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning).
Gemini: The newest entrant, aimed to be a game-changer. While still very new, its initial results show it either matching or exceeding GPT-4 on many benchmarks (GPT-4o Benchmark - Detailed Comparison with Claude & Gemini) (Introducing Gemini: Google’s most capable AI model yet). Importantly, it shines in areas of deliberate reasoning and creativity ([2410.18336] Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems). Gemini is built multimodal and with advanced techniques (likely including reinforcement learning and search algorithms for reasoning), giving it an edge in flexibility. In our evaluation, we see Gemini as the model pushing closest to the “cognitive limit” – it’s not just repeating knowledge, but synthesizing it in powerful ways. It achieved a milestone of surpassing human expert average on a broad academic test (MMLU) (Introducing Gemini: Google’s most capable AI model yet), which is indicative of a high general intelligence level. For math, this means Gemini can handle an incredible variety of problems, likely including niche topics not explicitly trained (because it can reason its way through with fundamentals). The fact that it outperformed others in generating novel solutions is especially exciting ([2410.18336] Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems) – it suggests a future where AI might actually propose new lemmas or simpler proofs to known theorems, thus contributing to mathematical pedagogy or research. At present, Gemini is mostly proprietary (part of Google’s services), but as it becomes available, it may set a new standard for what we expect from AI in technical disciplines.

Final Thoughts: Using the DIKWP framework to evaluate these models underscores the importance of semantic depth and structured reasoning in achieving high mathematical performance. The models that perform best are those that effectively integrate all five layers – they accurately perceive the data, understand the information (what is being asked), apply knowledge (known formulas/theorems), exercise wisdom (choose appropriate strategies, make logical inferences), and execute processes (carry out calculations or formal proofs). When any of these layers is weak, the model fails either by misunderstanding the problem (semantic error) or by making a calculation mistake or logical leap (process error).

Our analysis shows that GPT-4, Claude, DeepSeek, and Gemini each have a strong balance of DIKWP elements, which is why they succeed so often. LLaMA is imbalanced (data/information maybe okay for simple things, but knowledge application and process execution very weak), leading to failures.

Furthermore, the ability to be creative and innovative corresponds to how well models handle the “Wisdom” layer of DIKWP – applying knowledge in non-routine ways. Gemini and DeepSeek’s training likely enhanced this, whereas GPT-4/Claude have it by virtue of scale and diversity of training data. LLaMA again lacks here due to less specialization.

In conclusion, these large language models have made significant strides in mathematical capability:

They can solve problems and prove statements at a level that often reaches an undergraduate math student (and in some cases, contest champion) proficiency, with DeepSeek and Gemini touching even graduate-level problem solving.
They maintain logical consistency and semantic understanding that mitigate many classic AI pitfalls (like nonsensical answers or misinterpretation).
While not truly independently innovative yet, they exhibit the building blocks of mathematical creativity.

Outlook: With continued advancements, including:

integrating formal proof verification to ensure absolute correctness,
using tool augmentation (like computer algebra systems or theorem provers) to handle tedious computations or check steps,
and embedding frameworks like DIKWP into the training loop for better structured reasoning,

we might soon see LLMs that not only learn all existing mathematics but can also assist in creating new mathematics. Researchers are already experimenting with LLMs in conjecture generation and exploring new problem spaces ([2410.18336] Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems). It’s plausible that in a few years, a collaboration of a top-tier LLM and a human mathematician could yield a new theorem or solution to an open problem – the human providing guidance and intuition, the LLM providing immense computational exploration and memory of mathematical literature.

For now, users of these models should leverage their strengths:

Use GPT-4/Claude/Gemini for general problem solving, explanations, and learning (“Why is this true?”, “How to approach this?”).
Use DeepSeek when you need a thorough, verified solution or proof to a problem (particularly if it’s within its training scope, like high school/contest math or standard theorems).
Be cautious with LLaMA or other smaller models – they are improving, but for serious math, stick to the big players or fine-tuned versions specifically trained for math.
Encourage a semantic approach: Ask the model to explain its reasoning step by step, verify each part, and even explicitly separate givens, needed information, and solution steps (mimicking DIKWP). This often leads to better answers and reduces errors, as studies have shown (DIKWP 白盒测评：利用语义数学降低大模型幻觉倾向-段玉聪的博文).

In essence, the current generation of LLMs has brought us to the cusp of a new era in mathematical problem solving – one where AI can handle much of the heavy lifting and allow humans to focus on guiding the creative process. Each model we evaluated contributes to this trajectory:

GPT-4 and Claude democratized access to high-level problem solving,
DeepSeek demonstrated the power of aligning AI closely with mathematical rigor,
Gemini hints at reaching and surpassing human expert performance in many domains, math included,
And the DIKWP framework provides a roadmap for how to further enhance these systems’ understanding and innovation.

Moving forward, combining the strengths of these models (perhaps ensemble approaches) and integrating explicit semantic reasoning tools could yield an “Math AI Assistant” that truly acts like an experienced mathematician: deeply knowledgeable, logically precise, creatively insightful, and maybe one day, capable of pioneering new mathematical frontiers alongside humans.

Sources:

OpenAI, Anthropic, and DeepMind model performance reports (Anthropic release Claude 3, claims >GPT-4 Performance — AI Alignment Forum) (GPT-4o Benchmark - Detailed Comparison with Claude & Gemini) (Introducing Gemini: Google’s most capable AI model yet).
Math Scholar review of DeepSeek (2025) highlighting its proof capabilities (DeepSeek: A breakthrough in AI for math (and everything else) « Math Scholar).
Alignment Forum analysis comparing Claude 3 and GPT-4 (Anthropic release Claude 3, claims >GPT-4 Performance — AI Alignment Forum).
“Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems” (Ye et al. 2024) ([2410.18336] Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems).
DIKWP semantic math theory and evaluation reports (DIKWP 白盒测评：利用语义数学降低大模型幻觉倾向-段玉聪的博文) (科学网-Democratization of Knowledge through DIKWP Semantics ...).
ArXiv paper on MATH dataset and GPT-4’s improvement with self-verification ().
Apple’s study on reasoning brittleness (Ars Technica, 2024) (MATH-Perturb: Benchmarking LLMs’ Math Reasoning Abilities against Hard Perturbations).
DeepSeek technical report (arXiv 2025) on its benchmark performance (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning).

转载本文请联系原作者获取授权，同时请注明本文来自段玉聪科学网博客。

链接地址：https://wap.sciencenet.cn/blog-3429562-1473540.html?mobile=1

分享到:

当前推荐数：1

推荐人：

刘跃

推荐到博客首页

网友评论0 条评论

该博文允许注册用户评论请点击登录