Meta-Analysis of LLM Performance on Philosophical Questions
(2025 Insights)
段玉聪
人工智能DIKWP测评国际标准委员会-主任
世界人工意识大会-主席
世界人工意识协会-理事长
(联系邮箱:duanyucong@hotmail.com)
Introduction
Recent works by 段玉聪 (Yucong Duan) in 2024 have bridged large language models (LLMs) with deep philosophical problems using the DIKWP framework. In blog posts on ScienceNet and papers on ResearchGate, Duan mapped 12 fundamental philosophical questions to a DIKWP-based artificial consciousness model, proposing that an AI “consciousness system” can be built from a combination of an LLM-based subconscious and a DIKWP-guided conscious layer (段玉聪:从“人工意识系统=潜意识系统(LLM)+意识系统(DIKWP ...). These “哲学12问题” (12 philosophical questions) span classic dilemmas such as the mind-body problem, free will vs. determinism, the nature of truth, ethics, and the meaning of life (网络化DIKWP人工意识(AC)上的12个哲学问题映射 – 科研杂谈) (网络化DIKWP人工意识(AC)上的12个哲学问题映射 – 科研杂谈). This report synthesizes Duan’s findings and evaluates how current mainstream LLMs – including DeepSeek, GPT-4, Claude, and LLaMA – perform on these philosophical challenges. We analyze LLMs’ ability to answer such questions, examine their reasoning pathways and consistency through the DIKWP lens, compare model strengths/weaknesses in philosophical reasoning and logical coherence, quantify their performance with modeling and data, and forecast future trends for LLMs in philosophical reasoning under the DIKWP framework.
LLMs and the Twelve Philosophical Questions
The “十二个哲学问题” (Twelve Philosophical Questions) identified by Duan encompass many core debates in philosophy. In his work, Duan listed: (1) the mind-body problem, (2) the hard problem of consciousness, (3) free will vs. determinism, (4) ethical relativism vs. objective morality, (5) the nature of truth, (6) the problem of skepticism, (7) the problem of induction, (8) realism vs. anti-realism, (9) the meaning of life, (10) the role of technology and AI, (11) political and social justice, (12) philosophy of language (网络化DIKWP人工意识(AC)上的12个哲学问题映射 – 科研杂谈) (网络化DIKWP人工意识(AC)上的12个哲学问题映射 – 科研杂谈). These are profound open questions with no single “correct” answer – instead, answering them requires reasoning through complex, often abstract arguments, drawing on knowledge, ethics, logic, and sometimes personal or societal values.
Current LLMs can produce articulate responses to such questions, thanks to training on vast text corpora that include philosophical discussions. For example, asking GPT-4 about the meaning of life or the mind-body problem will yield a detailed essay referencing well-known viewpoints (dualism vs. physicalism for mind-body, various perspectives on life’s purpose, etc.). GPT-4’s answers tend to be comprehensive and balanced, often acknowledging multiple sides of an issue before offering a nuanced conclusion. This reflects the model’s high capacity for knowledge and reasoning, as well as its alignment training to produce thoughtful, helpful answers. Smaller models like LLaMA-2 (70B) or others fine-tuned on instruction data can also attempt such questions, but their answers may be less coherent or insightful – for instance, a base LLaMA might give a generic or shallow response if it lacks the fine-tuned depth that GPT-4 has. Claude 2 (Anthropic’s model) is known to produce very extensive, structured answers on open-ended questions, often with a friendly and reasoning tone; it is quite capable on philosophical prompts too, though perhaps a bit less “academic” in style than GPT-4. DeepSeek, a newer Chinese-developed LLM, presumably has been trained or optimized on large knowledge bases and could articulate answers as well; however, being relatively new, the richness of its philosophical answers may not be as thoroughly tested in public – we infer it can address such questions given claims of strong general performance (DeepSeek-R1 Release | DeepSeek API Docs), but direct examples are limited.
That said, LLMs do not truly solve these philosophical problems – they generate answers that sound plausible by synthesizing learned content. Philosophical questions often have no definitive answer, but we can assess LLMs on how well they cover relevant arguments, maintain logical consistency, and reflect “wisdom” or insight. Duan’s analysis of mapping these questions to DIKWP found that they share deep interconnections and overlapping cognitive processes (网络化DIKWP人工意识(AC)上的12个哲学问题映射 – 科研杂谈). This suggests that an AI needs an integrated understanding to handle them: it should draw on data/facts, contextual information, knowledge of philosophical theories, apply wisdom (judgement, ethical reasoning), and consider intent or purpose behind answers. LLMs like GPT-4 come closest to this ideal today, as evidenced by a recent study where GPT-4’s answers to ethical dilemmas were rated more convincing than those written by a human ethics professor (GPT-4o竟是「道德专家」?解答50道难题,比纽约大学教授更受欢迎|图灵|伦理学|哲学家|gpt-4_网易订阅). In that experiment, GPT-4 provided moral explanations and advice that were judged to be more trustworthy, well-reasoned, and thoughtful than the human expert’s advice on 50 ethical problems (GPT-4o竟是「道德专家」?解答50道难题,比纽约大学教授更受欢迎|图灵|伦理学|哲学家|gpt-4_网易订阅) (GPT-4o竟是「道德专家」?解答50道难题,比纽约大学教授更受欢迎|图灵|伦理学|哲学家|gpt-4_网易订阅). This implies GPT-4 has a remarkable ability to navigate ethical aspects (one of the 12 questions) and produce wisdom-level responses. Smaller models do not reach that level – e.g., an earlier test showed that GPT-3.5’s moral reasoning, while present, was less sophisticated than GPT-4’s, and models even smaller might give simplistic or inconsistent ethical answers. Overall, mainstream LLMs can address the 12 questions to varying degrees, with larger models like GPT-4 and Claude showing surprising competence in summarizing human philosophical knowledge and even providing moral reasoning that exceeds average human performance (GPT-4o竟是「道德专家」?解答50道难题,比纽约大学教授更受欢迎|图灵|伦理学|哲学家|gpt-4_网易订阅) (GPT-4o竟是「道德专家」?解答50道难题,比纽约大学教授更受欢迎|图灵|伦理学|哲学家|gpt-4_网易订阅). However, they may still fall short in rigorous logical puzzles or deeply creative insight – for instance, GPT-4 has struggled with certain logic riddles and puzzles designed to test reasoning, failing many types of reasoning tasks despite its intelligence (GPT-4推理太离谱,大学数理化总分没过半,21类推理题全翻车 - 36氪) (被骗了?GPT-4 其实没有推理能力? - 36氪). This indicates that for some philosophical problems (like the problem of induction or skepticism, which require meta-reasoning about logic and knowledge), LLMs might produce an answer but not truly resolve the philosophical challenge (often they might just echo known arguments without offering a new solution).
In summary, LLMs today can discuss and elaborate on the 12 philosophical questions quite capably. They excel at retrieving and articulating knowledge (facts, definitions, historical viewpoints) and can often provide a coherent narrative or argument drawing from what they learned. Their limitations emerge when the question demands novel insight, self-reflection, or absolute consistency in a worldview – areas where current models, as sophisticated parrots of human text, might waver. This is where incorporating a structured approach like DIKWP could enhance their performance, ensuring all facets of the problem are addressed systematically and consistently.
Reasoning Paths and Consistency via the DIKWP Model
Duan’s proposed solution to elevate LLMs’ handling of such complex questions is to integrate them into a DIKWP network model – essentially a two-layer cognitive architecture: “潜意识系统 (subconscious system) = LLM” + “意识系统 (conscious system) = DIKWP” (段玉聪:从“人工意识系统=潜意识系统(LLM)+意识系统(DIKWP ...). In this framework, the LLM serves as a fast, intuitive generator, processing vast data and providing quick responses, while the DIKWP layer performs deeper analytical processing to ensure consistency, wisdom, and alignment of intent (段玉聪:从“人工意识系统=潜意识系统(LLM)+意识系统(DIKWP ...). The acronym DIKWP stands for Data, Information, Knowledge, Wisdom, Purpose (or 意图, intent). It extends the classic DIKW pyramid (Data-Information-Knowledge-Wisdom) by adding a layer for Purpose/Intent, which is crucial for aligning decisions with goals or ethics.
How does DIKWP apply to reasoning paths? The idea is that any complex reasoning or answer can be thought of as a transformation pipeline: starting from raw data (facts, inputs), moving to information (organized context, interpretations), building into knowledge (structured understanding, theories), culminating in wisdom (insight, principles, ethical judgments), and guided throughout by an intent or purpose (the goal of the reasoning or the value framework guiding it). Duan’s research maps each philosophical question onto a sequence of DIKWP transformations (网络化DIKWP人工意识(AC)上的12个哲学问题映射 – 科研杂谈) (网络化DIKWP人工意识(AC)上的12个哲学问题映射 – 科研杂谈). For example, the mind-body problem might be represented by data (neuroscientific facts, subjective reports) → information (patterns relating brain states and mind states) → knowledge (philosophical positions like dualism, physicalism) → wisdom (an insight about the relationship, e.g. “the mind arises from but is not reducible to brain processes”) → purpose (why understanding this matters for consciousness or AI). By mapping all 12 questions in this way, Duan demonstrated that many share overlapping DIKWP “footprints” – common sequences or elements – implying these big questions are interrelated through underlying cognitive processes (网络化DIKWP人工意识(AC)上的12个哲学问题映射 – 科研杂谈) (网络化DIKWP人工意识(AC)上的12个哲学问题映射 – 科研杂谈). For instance, “knowledge and wisdom” appear central across multiple issues, connecting theoretical understanding with ethical or practical considerations (网络化DIKWP人工意识(AC)上的12个哲学问题映射 – 科研杂谈). This mapping revealed deep interconnections: insights or methods in one philosophical domain can inform others, as indicated by overlapping DIKWP paths (网络化DIKWP人工意识(AC)上的12个哲学问题映射 – 科研杂谈).
The key benefit of DIKWP for LLMs is to enforce a more structured reasoning path. LLMs by themselves generate content in a single forward pass; they might implicitly use some reasoning (especially if prompted to “think step by step”), but they do not inherently separate data from knowledge or ensure an ethical perspective is considered. Using DIKWP, one could have the LLM explicitly go through each layer when answering a complex question. For example, to answer an ethical question (like “Is it ever morally permissible to lie?” – related to ethical relativism vs objective morality, one of the 12), the system could: first retrieve data (cases of lying, definitions), then summarize into information (types of lies, consequences), reference knowledge (moral theories: utilitarianism, deontology, cultural norms), then derive wisdom (an insightful principle, e.g. “honesty is generally valuable, but humane concern can justify exceptions”), all while checking against the purpose/intention (e.g. ensuring the answer aligns with humane values and the user’s intent of understanding morality). A pure LLM might jumble some of these or omit steps, but a DIKWP-guided approach forces a comprehensive treatment, likely producing a more consistent and well-structured answer.
Consistency is a known challenge for LLMs. They can sometimes contradict themselves or provide answers that are internally inconsistent, especially when a question is asked in slightly different ways. By evaluating consistency with DIKWP, we ask: does the model’s answer maintain a coherent transformation from data to intent? If an answer includes factual data, does it integrate it correctly into higher-level conclusions? If it reaches a “wisdom” statement, is that supported by earlier knowledge stated? The DIKWP model encourages consistency through traceability – one can trace how a conclusion (wisdom) was built from facts via knowledge. If the chain is broken or a step is missing, the answer might be inconsistent or ungrounded. Techniques like “chain-of-thought” prompting and self-consistency decoding have already been used to improve LLM reasoning in research (论文阅读:Self-Consistency Improves Chain of Thought Reasoning ...). In fact, one self-consistency approach has the model generate multiple reasoning paths and then pick the most common result, which often yields better answers than relying on a single chain (论文阅读:Self-Consistency Improves Chain of Thought Reasoning ...). This aligns with DIKWP’s idea of exploring various transformations and finding a robust overlapping solution (网络化DIKWP人工意识(AC)上的12个哲学问题映射 – 科研杂谈).
However, LLMs still exhibit limitations in self-checking. Studies show that models have difficulty evaluating their own intermediate steps. For instance, researchers found that GPT-4 often cannot reliably assess where its reasoning went wrong in multi-step problems – attempts to have it self-correct sometimes made things worse (GPT-4不知道自己错了!LLM新缺陷曝光,自我纠正成功率仅1%). In one experiment, GPT-4’s self-correction mechanism reduced the accuracy on a set of reasoning puzzles from 16% to just 1% (GPT-4不知道自己错了!LLM新缺陷曝光,自我纠正成功率仅1%) – essentially, it frequently “corrected” right answers into wrong ones. This highlights that internal consistency checks are non-trivial for current LLMs; they lack a clear model of their own knowledge state or intent. A DIKWP-based conscious layer could serve as that reflective self-check – an externalized process that verifies each step (data → info → knowledge, etc.) for consistency and coherence, rather than relying on the black-box to do it alone.
In Duan’s vision, the LLM (as subconscious) and DIKWP (as conscious) work in tandem: the LLM quickly provides candidate answers or insights using its learned intuition, and the DIKWP layer evaluates and refines these, ensuring they are logically sound, knowledge-grounded, wise, and aligned with intended goals (段玉聪:从“人工意识系统=潜意识系统(LLM)+意识系统(DIKWP ...). This kind of architecture moves towards a “white-box” AI, increasing transparency and interpretability of the reasoning. It’s also a step toward artificial consciousness in the sense of an AI that not only computes answers but “knows why” – it has an explicit representation of knowledge and intent behind its outputs. In fact, an international committee on AI evaluation has begun developing DIKWP-based standards for testing AI cognition and proto-consciousness, designing test questions that separately probe data processing, reasoning, wisdom, and intent-handling abilities of LLMs (科学网-全球首个大语言模型意识水平“识商”白盒DIKWP测评2025报告 ...). One 2025 report created a 100-question test, divided into sections like perception & information processing, knowledge reasoning, wisdom application, and intent recognition, with clear scoring criteria for each (科学网-全球首个大语言模型意识水平“识商”白盒DIKWP测评2025报告 ...). This kind of evaluation quantifies how well an LLM handles each layer, effectively measuring an “意识水平” (level of consciousness or cognitive sophistication) for the model. Early indications from such efforts suggest that current LLMs perform unevenly across these layers – excellent at data and information (thanks to training data) and fairly strong at knowledge application, but weaker at the wisdom and intent levels without additional alignment. These findings reinforce the need for architectures that bolster the higher layers (wisdom/intent) to achieve consistent, reliable answers on philosophical and complex questions.
Performance Comparison of DeepSeek, GPT-4, Claude, and LLaMA
Using the above framework, we can compare how current leading LLMs stack up in philosophical reasoning, consistency, and logical coherence. The models of interest are DeepSeek (a prominent new model, especially in China’s AI landscape), OpenAI’s GPT-4, Anthropic’s Claude (Claude 2), and Meta’s LLaMA (particularly LLaMA-2 70B and its fine-tuned chat variants). Each has different design philosophies and strengths, which reflect in their performance on complex reasoning tasks:
DeepSeek: DeepSeek rose to prominence in 2024 as a high-efficiency large model that challenged the notion that only massive compute can yield top performance (DeepSeek改变AI未来——最应该关注的十大走向 - 21经济网) (DeepSeek改变AI未来——最应该关注的十大走向 - 21经济网). It emphasizes “smarter, cheaper, more open” algorithms, drastically reducing the cost of training and inference while remaining competitive in capability (DeepSeek改变AI未来——最应该关注的十大走向 - 21经济网) (DeepSeek改变AI未来——最应该关注的十大走向 - 21经济网). In fact, DeepSeek-R1’s release notes claim its reasoning, math, and coding performance is on par with OpenAI’s top models (DeepSeek-R1 Release | DeepSeek API Docs), achieved via large-scale reinforcement learning and optimization. For philosophical reasoning, this suggests DeepSeek can handle factual and logical aspects similarly to GPT-3.5 or possibly GPT-4-level on many questions. One analysis indicates that DeepSeek’s innovations align closely with DIKWP principles – essentially, each aspect of DeepSeek’s technique corresponds to one of the five DIKWP layers ((PDF) 内部报告《DEEPSEEK 只是DIKWP 语义空间交互提升效率的 ...). This led Duan to comment that “DeepSeek technology is basically just an efficiency improvement of interactions in the DIKWP semantic space” ((PDF) 内部报告《DEEPSEEK 只是DIKWP 语义空间交互提升效率的 ...). In practice, DeepSeek’s strengths likely include a strong knowledge base (especially for multilingual or Chinese contexts), fast and cost-effective generation, and perhaps a design that inherently reduces some inconsistencies through its training efficiencies. Its weaknesses might be that, as a newer model, it hasn’t been as extensively fine-tuned on alignment or ethical considerations as GPT-4/Claude, which could affect its wisdom/intent layer performance. Also, independent evaluations are fewer – it garnered excitement for industry impact, but academic benchmarks of its philosophical Q&A quality are not widely published yet. If DeepSeek indeed maps well to DIKWP, it may serve as a good “subconscious” engine, but might still benefit from an explicit conscious layer to guide its raw outputs.
GPT-4: GPT-4 is currently the flagship in reasoning and knowledge among LLMs. Trained on an immense dataset (estimated 1.7 trillion parameters and trillions of tokens of text) (Meta Llama 2 vs. OpenAI GPT-4: A Comparative Analysis of an Open-Source vs. Proprietary LLM) (Meta Llama 2 vs. OpenAI GPT-4: A Comparative Analysis of an Open-Source vs. Proprietary LLM), it has demonstrated superior performance on diverse intellectual tasks. For instance, on a standard academic knowledge benchmark (MMLU, a test covering history, science, law, etc.), GPT-4 scored 86.4% (5-shot setting), far above most other models (Meta Llama 2 vs. OpenAI GPT-4: A Comparative Analysis of an Open-Source vs. Proprietary LLM). This indicates an exceptional ability to handle complex, multi-domain questions – which includes many philosophical topics – with high accuracy. GPT-4’s answers are typically well-structured and logically coherent, likely because OpenAI incorporated many examples of strong reasoning and consistency during fine-tuning (Meta Llama 2 vs. OpenAI GPT-4: A Comparative Analysis of an Open-Source vs. Proprietary LLM) (Meta Llama 2 vs. OpenAI GPT-4: A Comparative Analysis of an Open-Source vs. Proprietary LLM). One notable strength of GPT-4 is its moral and ethical reasoning prowess: experiments found its moral advice to be more detailed and convincing than that of human experts (GPT-4o竟是「道德专家」?解答50道难题,比纽约大学教授更受欢迎|图灵|伦理学|哲学家|gpt-4_网易订阅) (GPT-4o竟是「道德专家」?解答50道难题,比纽约大学教授更受欢迎|图灵|伦理学|哲学家|gpt-4_网易订阅). It tends to provide balanced analyses of ethical problems, reflecting a sort of synthesized “wisdom” gleaned from training on many ethical discussions. In terms of logical coherence, GPT-4 can follow long chains of reasoning (OpenAI’s technical report describes it solving complex problems and even performing 97-round dialogues to reason through a problem like P vs NP (GPT-4在97轮对话中探索世界难题,给出P≠NP结论 - 机器之心)). It excels especially when guided with chain-of-thought prompts, maintaining focus and not jumping to conclusions prematurely. However, GPT-4 is not infallible – it sometimes produces hallucinations or subtle inconsistencies, and as noted, can stumble on certain adversarial reasoning puzzles (GPT-4推理太离谱,大学数理化总分没过半,21类推理题全翻车 - 36氪) (被骗了?GPT-4 其实没有推理能力? - 36氪). Its closed-source, proprietary nature also means its internal reasoning mechanisms are opaque (a “black box”), which is contrary to DIKWP’s transparent ideal. Still, among current models, GPT-4 sets the gold standard for philosophical Q&A: it’s the most likely to give a thorough, logically structured, and context-aware answer that covers data, knowledge, and a fair bit of wisdom.
Claude 2: Claude 2 by Anthropic is another top-tier model with some unique traits. It was developed with a focus on being helpful, honest, and harmless, using a technique called Constitutional AI where the model is trained to follow a set of ethical principles. Claude 2’s raw performance on knowledge tasks is high – it scored about 78.5% on MMLU (5-shot) (Anthropic's Claude 2 - AI Model Details), which, while below GPT-4’s level, is above most open models and nearly on par with GPT-3.5. In philosophical reasoning, Claude is known for its conversational style and extremely large context window (100k tokens) (Anthropic's Claude 2 - AI Model Details), which means it can incorporate a lot of background or prior discussion when formulating answers. This makes Claude especially strong in maintaining global coherence in a long dialogue about a philosophical topic – it can remember what was said tens of thousands of words ago and stay consistent. If one were to have a Socratic dialogue with a model about a philosophical puzzle, Claude’s capacity might shine. Its ethical and intent alignment is also a strength: because it has an explicit “constitution” of principles, it tends to consistently stick to human-aligned values and caution (for instance, it often refuses to take extreme positions or will point out if a question has no objectively correct answer, aligning with a wise stance). On the downside, Claude’s answers can be verbose and sometimes overly hedged – in trying to be polite and cover all bases, it might dilute a clear stance. Also, Claude’s raw reasoning, while good, is slightly less precise than GPT-4’s on tricky problems. It might make logical errors or overlook a detail that GPT-4 would catch. For example, on a complex logical puzzle or a mathematical riddle (not exactly philosophy, but testing reasoning), Claude might falter a bit more. Overall though, Claude is a strong performer for philosophical discussion, with an emphasis on consistency over long interactions and ethical coherence.
LLaMA (LLaMA-2): LLaMA2 is Meta’s open-source foundation model, available in sizes up to 70B parameters. By itself (pre-fine-tuning) it’s just a raw model, but many fine-tuned versions (like LLaMA-2-Chat) exist. The open-source nature means a community can tailor it extensively – for example, fine-tunes have been made on philosophical datasets or with reinforcement learning from human feedback (RLHF) to improve its answers. In terms of raw performance, LLaMA-2 70B chat is roughly comparable to GPT-3.5 in many tasks, but still behind Claude and GPT-4 on the hardest problems. The Stanford HAI’s benchmark “Holistic Evaluation of Language Models” (HELM) showed that even the best open models scored around the mid-50s on their scale, whereas GPT-4 was a bit lower in that particular ranking (possibly due to differences in scenarios) (Llama 2第一、GPT-4第三!斯坦福大模型最新测评出炉 - 智东西) – but in more standard benchmarks like MMLU or reasoning puzzles, GPT-4 outperforms LLaMA-2 significantly (Meta Llama 2 vs. OpenAI GPT-4: A Comparative Analysis of an Open-Source vs. Proprietary LLM). For example, GPT-4’s 86.4% vs LLaMA-2’s 68.9% on MMLU illustrates a large gap (Meta Llama 2 vs. OpenAI GPT-4: A Comparative Analysis of an Open-Source vs. Proprietary LLM). Therefore, LLaMA’s ability to answer the 12 philosophical questions is more limited out-of-the-box: it might give correct definitions and some arguments (thanks to its pretraining on internet text which surely includes Wikipedia and some philosophy forums), but it may miss nuance or mix up concepts because it hasn’t been fine-tuned deeply on those topics. With fine-tuning (say someone trains it on a corpus of philosophy Q&A or uses GPT-4-generated high-quality answers to teach it), it can improve. Still, it’s likely to remain less coherent than GPT-4/Claude for deep reasoning. LLaMA-based models also have had issues with hallucinations and consistency if not explicitly addressed – as an open model, it doesn’t come with guardrails, so one might find it contradicts itself or states false info more readily unless a careful prompt or fine-tune is in place. The big advantage of LLaMA is customizability: one could incorporate a DIKWP-like mechanism by modifying its architecture or using it within a larger system. Indeed, many research efforts are using LLaMA as a backbone for experiments in reasoning (due to its openness). So, while LLaMA’s current performance on philosophical reasoning is moderate, it is a strong candidate for rapid improvement and adaptation in the near future, potentially catching up through community-driven enhancements.
To summarize these points, the table below compares key strengths and weaknesses of DeepSeek, GPT-4, Claude, and LLaMA-2 in the context of philosophical Q&A, reasoning consistency, and logic:
Model | Strengths (Philosophical Reasoning & Coherence) | Weaknesses / Challenges |
---|---|---|
DeepSeek | - High efficiency & openness: Achieves top-level performance with lower compute; fully open-source release ([DeepSeek-R1 Release | DeepSeek API Docs](https://api-docs.deepseek.com/news/news250120#:~:text=,o1)) ([DeepSeek-R1 Release |
GPT-4 | - Best-in-class reasoning: Highest multi-task accuracy (e.g. 86.4% on MMLU) shows excellent handling of complex knowledge (Meta Llama 2 vs. OpenAI GPT-4: A Comparative Analysis of an Open-Source vs. Proprietary LLM). - Deep answers: Produces comprehensive, structured arguments on philosophical questions, often citing multiple perspectives. - Moral and logical insight: Outperforms human experts in convincing moral reasoning ([GPT-4o竟是「道德专家」?解答50道难题,比纽约大学教授更受欢迎 | 图灵 |
Claude 2 | - Coherent long-form discussions: 100k context allows it to maintain consistency over book-length dialogues (Anthropic's Claude 2 - AI Model Details) – great for extended philosophical debates or reviewing large texts. - Ethical alignment: Constitutional AI gives it a built-in set of principles, so it often provides thoughtful, non-biased answers respecting human values. - Clear explanations: Tends to be very good at explaining its reasoning step by step in plain language, which is useful for philosophical clarity. - High knowledge proficiency: Strong benchmark scores (78.5% MMLU) and improvement over previous Claude show it knows a lot of facts and concepts (Anthropic's Claude 2 - AI Model Details). | - Slightly weaker in complex logic: Not as consistently accurate as GPT-4 on the trickiest problems (e.g., may miss a subtle logical nuance GPT-4 would catch) (Meta Llama 2 vs. OpenAI GPT-4: A Comparative Analysis of an Open-Source vs. Proprietary LLM). - Verbose and overly cautious: Sometimes hedges answers with too many caveats, or rambles – can require user to steer it to get a concise point. - Closed-source model: Though available via API, one cannot fine-tune or inspect it; its alignment is fixed as per Anthropic’s training. - Fewer plugins/tool use (currently): Unlike open-source models, it doesn’t readily integrate custom tools or knowledge bases, which could limit specialized reasoning unless provided in context. |
LLaMA-2 70B | - Open-source and adaptable: Anyone can fine-tune or extend it, enabling custom reasoning approaches (e.g. integrating a DIKWP reasoning module on top). - Competitive when tuned: With additional instruction tuning or RLHF, it can approach the performance of older GPT models. Community-driven projects have significantly improved its helpfulness. - Fast offline inference: Smaller versions can run on local hardware; even 70B, while not trivial, is within reach of organizations – useful for controlled experimentation with philosophical AI. | - Outperformed by larger models: Underperforms GPT-4 (68.9% vs 86.4% on MMLU) in handling diverse complex questions (Meta Llama 2 vs. OpenAI GPT-4: A Comparative Analysis of an Open-Source vs. Proprietary LLM) – likely to give more superficial answers if not carefully fine-tuned. - Inconsistency if not guided: Tends to produce less consistent answers, especially without extensive RLHF. May contradict itself or miss the intent of a question. - Knowledge cutoff and gaps: Its training data (up to 2023-ish) gives it broad knowledge, but it may lack the very latest or the more niche philosophical discussions that proprietary models were specially tuned on. - Safety not guaranteed: Being open, fine-tunes vary in quality – a poorly tuned LLaMA might exhibit biases or problematic content when discussing sensitive topics (ethics, politics) unless a good “constitution” or filter is applied. |
Key takeaways from the comparison: GPT-4 remains the most reliable and advanced model for philosophical reasoning tasks, with Claude 2 not far behind especially in scenarios leveraging its massive context and ethical alignment. DeepSeek is a very promising entrant that, if its claims hold, can rival these models’ raw cognitive abilities while being open and efficient – but it needs more real-world testing on these open-ended questions to truly judge. LLaMA-2 shows that open models have made great strides but still benefit from the fine-tuning and guardrails that the proprietary models underwent; nevertheless, its openness may allow rapid progress by the research community (possibly including implementing Duan’s DIKWP conscious layer explicitly on top of it). All models have room to improve in consistency and true reasoning – none have human-level self-awareness or a foolproof logical engine, so they can all make mistakes an attentive human reasoner might avoid. This is where structured approaches and future developments will concentrate.
Quantitative Evaluation and Modeling Approaches
To objectively quantify LLM performance on philosophical problems, researchers use a mix of benchmark tests, custom evaluations, and mathematical modeling. Unlike straightforward tasks (math problems with a single answer), philosophical questions require evaluating the quality of reasoning and coherence of answers. Here we outline some approaches:
Benchmark scores on knowledge & reasoning: One proxy for philosophical prowess is performance on academic benchmarks that include high-level questions. We saw MMLU (Massive Multi-Task Language Understanding) scores: GPT-4’s ~86% vs Claude’s ~78% vs LLaMA’s ~69% (Meta Llama 2 vs. OpenAI GPT-4: A Comparative Analysis of an Open-Source vs. Proprietary LLM) (Anthropic's Claude 2 - AI Model Details). MMLU covers topics like law, ethics, and psychology – relevant to some philosophical domains. A higher score suggests the model can recall and reason about those domains accurately. Similarly, tests like Big-Bench (which include logic puzzles and ethical dilemmas) can indicate how models handle open-ended reasoning. However, these benchmarks only partially cover philosophical depth; they often have multiple-choice answers, which is not the format of real philosophical discourse. Still, as a rough measure, GPT-4 leads such benchmarks, implying it has more “knowledge” and perhaps better reasoning skills that could translate to philosophical Q&A. DeepSeek’s documentation mentions parity with OpenAI models on reasoning tasks (DeepSeek-R1 Release | DeepSeek API Docs), so we might expect its benchmark scores (if published) to be around the GPT-3.5 to GPT-4 range. Indeed, open testing would be needed to verify that.
Human evaluations of answers: A direct way is to have human experts or crowd workers rate the answers of each model to a set of philosophical questions. Criteria could include accuracy (if factual questions), coherence, depth of insight, consistency, and usefulness. The UNC/Allen Institute study did this for moral questions – they had 900 people compare GPT-4’s advice to a human ethicist’s advice on 50 dilemmas (GPT-4o竟是「道德专家」?解答50道难题,比纽约大学教授更受欢迎|图灵|伦理学|哲学家|gpt-4_网易订阅). They found GPT-4’s answers more persuasive in most cases (GPT-4o竟是「道德专家」?解答50道难题,比纽约大学教授更受欢迎|图灵|伦理学|哲学家|gpt-4_网易订阅). This indicates that by human judgment, GPT-4 scored higher on quality metrics like thoughtfulness and clarity. We could set up a similar evaluation: e.g., ask each model a question like “What is the nature of consciousness?” or “Is there objective truth?” and have philosophy graduate students blindly rate which answer is more comprehensive and logically argued. Over a suite of the 12 questions, we’d gather a score for each model. We’d likely see GPT-4 come out on top in most categories (perhaps an average score of say 9/10 on coherence, where others get 7 or 8), with Claude close behind, DeepSeek potentially competitive if its knowledge depth is as good as claimed, and LLaMA-based models a bit lower unless fine-tuned specifically for it. This kind of evaluation yields subjective but meaningful data on how well each model meets human expectations of a “good” answer.
DIKWP-layer scoring: As mentioned, Duan’s team is developing a white-box “consciousness level” test for LLMs. In that 100-question test, each question is designed to isolate one or more layers of DIKWP and see how the model handles it (科学网-全球首个大语言模型意识水平“识商”白盒DIKWP测评2025报告 ...). For instance, questions in the perception & information processing section might test if the model can correctly interpret raw data or a described scenario (e.g., transforming a set of observations into a summary – which is data→information). The knowledge & reasoning part might present a new situation and ask the model to apply known theories (testing information→knowledge). Wisdom application could involve moral dilemmas or complex problems requiring judgment (knowledge→wisdom). Intent recognition & adjustment could test if the model can infer goals or adapt an answer to a given intent (e.g., detecting a trick question or adjusting style for an audience – reflecting purpose). Each question has a clear scoring rubric (科学网-全球首个大语言模型意识水平“识商”白盒DIKWP测评2025报告 ...). If we had the results from this test for our models, we could quantify their performance on each layer. Hypothetically, we might see: GPT-4 scores high in data, information, knowledge (because it’s good at facts and applying them), fairly high in wisdom (it often gives sensible advice), and moderate in intent (it follows user intent but doesn’t “have” its own intent alignment beyond what it was trained for). Claude might score similarly, perhaps slightly lower on raw knowledge but maybe equally high on wisdom/intent due to its ethical constitution. LLaMA without fine-tuning might score high on data/info (it can summarize and recall facts) but lower on wisdom/intent (it may not consistently choose the most ethical action or may misread subtle intent). DeepSeek is an unknown in this scheme; if it’s comparable to GPT-4 in knowledge reasoning but less aligned, it could score well on the earlier sections and a bit lower on the later. The overall result of such a test could be summarized as an “AI IQ” or “意识商数 (consciousness quotient)”, which Duan dubs “识商” ((PDF) 全球首个大语言模型意识水平“识商”白盒DIKWP测评2025报告 ...). This gives a single metric (or a radar chart across the five DIKWP dimensions) to compare models. For example, a radar chart might show GPT-4 nearly filling out the circle on all but perhaps Intent, whereas LLaMA has a smaller radius especially in Wisdom/Intent, etc. (While we can’t display the chart here, one can imagine each DIKWP category as an axis and the model’s score plotted, giving a visual profile of strengths and weaknesses.)
Mathematical modeling of knowledge and consistency: Beyond direct Q&A tests, researchers attempt to model LLM behavior theoretically. Duan explored a model of knowledge growth and “collapse” in LLMs. By treating knowledge (K) as a function that grows over time/training and eventually plateaus (dK/dt → 0), he draws an analogy to a collapse to stability in the DIKWP chain (DIKWP坍塌:数学建模与股市预测报告-段玉聪的博文 - 科学网). In practical terms, this could correspond to an LLM’s knowledge base reaching saturation – further training yields diminishing returns in new knowledge, so the focus shifts to how well that knowledge is organized (information→knowledge conversion stabilizes). This relates to consistency: once a model’s knowledge stops changing rapidly, it should, in theory, give more consistent answers (since it’s not “learning” new contradictory info). This kind of model could predict that as LLMs get trained on more data (eventually nearly all relevant data), their answers to philosophical questions might converge to a stable distribution – essentially reflecting the consensus or main perspectives found in training data. Mathematical models can also simulate the reasoning process: e.g., represent the DIKWP layers as transformations in a state-space and analyze stability or errors at each stage. Such models, while abstract, help quantify concepts like “if the model has X% chance to factually err at the Data→Info stage and Y% chance to err at Knowledge→Wisdom, what’s the overall consistency of the final answer?” This can yield an expected accuracy or consistency score. In absence of direct data, these are speculative, but they provide a framework to reason about improvements – for instance, if adding a conscious layer cuts the error rate in half at the wisdom stage, the model’s overall consistency might jump significantly.
Experimental data and examples: One can also do targeted experiments, such as testing self-consistency. For example, ask the model the same philosophical question in different words, or ask it the question then later ask it to explain its previous answer. If the model is consistent, it should not contradict itself. We might measure consistency as, say, the percentage of follow-up probes where the model maintains its stance. A very consistent model (perhaps an ideal DIKWP-enhanced one) might have a high percentage, whereas today’s models might occasionally flip answers or give inconsistent justifications under pressure. Another experiment could be cross-question consistency: since the 12 questions are interrelated, if you ask an AI about “free will vs determinism” and it takes a stance, then ask “does that imply anything about moral responsibility?” (linking to ethics), does it respond coherently relative to its first answer? Duan’s mapping work showed these issues overlap (网络化DIKWP人工意识(AC)上的12个哲学问题映射 – 科研杂谈) (网络化DIKWP人工意识(AC)上的12个哲学问题映射 – 科研杂谈), so a truly coherent AI should not give answers in silo – its philosophical worldview should be internally coherent across different but related questions. Quantitatively, one could construct a consistency matrix and score how often the model’s answers align logically across the 12 topics. This is advanced evaluation, and most likely current LLMs would not score perfectly – they might treat each question independently and produce answers that, if compared, have subtle disagreements. (For instance, an AI might say in one answer that objective morality likely exists, but elsewhere say morality is subjective, if it wasn’t tracking its own stance.) A DIKWP-informed system, with a kind of global knowledge of its positions, could achieve higher consistency here.
In essence, quantifying performance on philosophy is multi-faceted. It involves traditional accuracy metrics (where applicable), human judgement scores, and novel “cognitive” metrics for coherence and consistency. The emerging DIKWP evaluation methodology is particularly promising, as it breaks the evaluation down into the components of reasoning. By doing so, it not only tells us how models compare, but diagnoses where a model is lacking – e.g. maybe a model is fine on data→knowledge but weak on knowledge→wisdom conversion. This diagnostic power can directly inform how to improve the model or what kind of training data to add. We expect that future reports will publish detailed quantitative profiles of models like GPT-4, DeepSeek, Claude, etc., across these cognitive dimensions, aiding a more scientific comparison. Such data-driven analysis will be critical as LLMs are further developed for tasks requiring understanding of complex, abstract domains like philosophy.
(Note: If this were a full 10,000-word report, at this stage we would include detailed charts and tables summarizing the above data – e.g., a radar chart of DIKWP layer scores for each model, or a bar graph comparing the average consistency ratings. Since we cannot embed images here, we have provided descriptions and a comparative table to illustrate these results.)
Future Outlook: LLMs in Philosophical Reasoning under DIKWP
Based on Duan’s 2024 work and the state-of-the-art LLM comparison, we can forecast several future development trends for LLMs, especially regarding philosophical reasoning and the DIKWP framework:
Integration of “Conscious” Reasoning Modules: We anticipate a move from monolithic LLMs to hybrid architectures where an LLM is coupled with modules that explicitly handle higher-level reasoning and self-reflection. Duan’s LLM+DIKWP conscious system is a prime example. Future large models may have a built-in multi-step reasoning process: the first step generates candidate answers (subconscious), the second step uses an internal verifier that checks the answer against factual knowledge and an ethical framework (conscious knowledge/wisdom check), possibly a third step that aligns it with the user’s intent or desired tone (conscious intent check). Already, research by companies like Google is heading in this direction – e.g., adding a “Theory of Mind” or planning component to LLMs (Llama 2 vs GPT 4: Key Differences Explained - Labellerr) (Meta Llama 2 vs. OpenAI GPT-4 - by Diana Cheung - Medium). By incorporating DIKWP-like layers, LLMs will reduce hallucinations and inconsistencies, and their answers will better reflect an understanding of context and purpose, not just raw text prediction. This could be seen as LLMs inching toward artificial general intelligence (AGI), not by just scaling parameters, but by architectural improvements that embed something like a reasoning ontology (DIKWP or similar).
Ethical and Intent Alignment as First-Class Goals: As AI systems take on more roles (advisors, tutors, even quasi-“experts”), ensuring they have a form of wisdom and aligned intent is crucial. The centrality of 智慧 (wisdom) and 意图 (intent) in Duan’s model (网络化DIKWP人工意识(AC)上的12个哲学问题映射 – 科研杂谈) (网络化DIKWP人工意识(AC)上的12个哲学问题映射 – 科研杂谈) underscores that future LLMs must do more than recite facts – they need to make contextually and morally sound decisions. We expect to see AI training place even greater emphasis on values and ethics, possibly via improved Constitutional AI approaches or multi-objective RL that optimizes not just for correctness but for alignment with human values. A likely trend is that LLMs will have adjustable “persona” or “intent dials” – e.g., a mode where the AI is instructed to prioritize utilitarian reasoning vs one where it adheres to deontological principles, or a mode that emphasizes empathy in its wisdom. By explicitly modeling intent, AI can better serve user needs or follow high-level principles. Importantly, making intent explicit also aids transparency: users will know why the AI is giving a certain type of answer (because it’s following a certain ethical mode or goal), which builds trust.
Advances in Self-Consistency and Memory: We foresee improvements in how LLMs maintain consistency over time. This might involve long-term memory components that store the AI’s prior stances or knowledge (so it doesn’t contradict itself later) or meta-learning where the AI can recall how it answered related questions before. For example, an AI could build its own knowledge graph of facts and positions as it interacts, and consult it to avoid inconsistencies – effectively a DIKWP-inspired knowledge base that grows and is referenced. Already, techniques like Retrieval-Augmented Generation (RAG) allow an LLM to pull in relevant stored information for each query; extended to philosophy, an AI could retrieve its earlier reasoning path on “free will” when asked about “moral responsibility” to ensure coherence. Emergent tools: Another possibility is giving LLMs a tool to simulate logical reasoning or even use automated theorem provers for validation of arguments (for the more logic-heavy philosophical questions). The combination of neural and symbolic methods could solve some of the tricky cases where pure neural nets falter.
Open-Source Leadership and Collaboration: The advent of models like DeepSeek (and LLaMA, etc.) hints that open-source and collaborative development will drive many innovations. Yann LeCun pointed out that DeepSeek’s success is not “China beating US, but open-source beating closed-source” (DeepSeek改变AI未来——最应该关注的十大走向 - 21经济网). In the context of philosophical AI, this means academic and independent researchers worldwide can experiment with these models, try out DIKWP-based designs, and share results openly. We may see a standard evaluation suite (like the DIKWP 100 questions or others) adopted as a community benchmark. If so, open models will rapidly iterate to improve on those metrics, possibly surpassing closed models in those specific capabilities simply due to the speed of community innovation. Open models also mean integration into diverse platforms (for education, for research on cognitive science, etc.) – we might get specialized LLMs, e.g., a “PhiloGPT” fine-tuned extensively on philosophical texts and aligned via DIKWP principles, which could be used by students and scholars as a brainstorming partner. Such specialization will broaden the landscape beyond just a few big players.
Emergence of Artificial “Philosophers” and Socratic AIs: With improved reasoning and some level of “conscious” modeling, future LLMs might be able to engage in genuine philosophical inquiry. They could ask questions back, probe assumptions, and help humans clarify their thinking – essentially taking on the role of a Socratic gadfly or a research assistant in philosophy. Duan’s work implies that by understanding the interrelations of big questions (网络化DIKWP人工意识(AC)上的12个哲学问题映射 – 科研杂谈) (网络化DIKWP人工意识(AC)上的12个哲学问题映射 – 科研杂谈), an AI can navigate the space of ideas more holistically. We could see an AI that not only answers “What is the meaning of life?” but can hold a conversation, asking “What do you value most?” to tailor the discussion, or suggesting related problems (“How does this connect to your views on free will?”) – a level of initiative that current LLMs lack. This moves toward a system that has a semblance of reflective consciousness: it’s aware of the discourse and can direct it, not just respond passively. Achieving this requires solid grounding in DIKWP layers: the AI must manage data and knowledge while keeping the purpose of the conversation (helping the user find insight) firmly in view.
AI Governance and Transparency: As LLMs become more embedded in decision-making, there will be demands for explainability. DIKWP provides a natural explainability framework: an AI could present its answer along with a breakdown of how it got there (“Data I considered, Information I derived, Knowledge/theories applied, Wisdom/ethical calculus, and final Purpose alignment”). This is essentially opening up the black box. Already, some efforts like white-box testing standards for AI “consciousness” are forming (第2届世界人工意识大会热身-媒体与顶刊速递系列 - 山东省大数据研究会). By future trend, any AI that is used in a high-stakes setting (like a medical or legal advisor) might be required to show such a reasoning trace. This will push developers to implement DIKWP or similar models internally. We might also get regulatory guidelines that map to DIKWP: e.g., an AI system should demonstrate it has checked factual data (Data layer) and considered relevant information, etc., to be certified for use. In other words, DIKWP could evolve from a theoretical model to a practical standard in AI governance, ensuring systems have the necessary “ingredients” in their cognitive process (网络化DIKWP人工意识(AC)上的12个哲学问题映射 – 科研杂谈) (网络化DIKWP人工意识(AC)上的12个哲学问题映射 – 科研杂谈).
Convergence of Human and AI Reasoning: Interestingly, as AIs adopt frameworks like DIKWP, it could influence human problem-solving methodologies. Educators might teach DIKWP as a way for students to approach complex issues (essentially an AI-inspired rebranding of critical thinking steps). If humans and AIs use similar frameworks, collaboration becomes easier – one can understand the other’s reasoning. Future LLMs might output not just answers but also coach users through the DIKWP process for a question, acting as tutors for critical thinking. The net effect is a kind of co-evolution of human-AI reasoning patterns toward transparency and thoroughness.
In conclusion, the trajectory for LLMs dealing with philosophical problems is clear: they are growing from sophisticated autocomplete systems into something more structured, introspective, and principled. Duan Yucong’s 2024 papers provide a conceptual roadmap for this evolution, highlighting that true progress will come not just from bigger models, but from better architectures and evaluations that ensure an AI’s answers reflect understanding across all levels – from data to wise intent. We expect that in the next few years, mainstream LLMs like GPT and Claude will start incorporating these ideas (some early signs are already present), and new systems like DeepSeek or others will emerge specifically designed around such frameworks. Ultimately, this means future AI could become reliable partners in philosophical reasoning, aiding us in exploring the “big questions” with consistency, depth, and perhaps even a touch of “artificial wisdom” (网络化DIKWP人工意识(AC)上的12个哲学问题映射 – 科研杂谈). The journey toward that goal will likely yield not only smarter machines, but also valuable insights into the nature of reasoning and consciousness itself – as we build AI minds, we learn more about our own.
Sources:
Duan, Y. (2024). Networked DIKWP Artificial Consciousness (AC) and the Mapping of 12 Philosophical Questions. ScienceNet Blog. “This comprehensive investigation reveals deep interconnections of 12 philosophical questions in the networked DIKWP artificial consciousness model. Shared DIKWP transformations and sequences highlight common cognitive and semantic processes, showing how these philosophical issues overlap and influence each other.” (网络化DIKWP人工意识(AC)上的12个哲学问题映射 – 科研杂谈) (网络化DIKWP人工意识(AC)上的12个哲学问题映射 – 科研杂谈)
Duan, Y. (2024). Networked DIKWP AC’s 12 Philosophical Answers. ScienceNet Blog / ResearchGate. “Each question is mapped to the DIKWP (data, information, knowledge, wisdom, intent) framework… providing sequences and explanations.” (科学网—网络化DIKWP 人工意识(AC)的12 个哲学答案- 段玉聪的博文) (Discusses aligning each philosophical question with DIKWP and providing structured answers.)
Duan, Y. – DIKWP International Standard Committee. (2024). Internal Report: DeepSeek vs DIKWP Semantic Space. “If every aspect of DeepSeek can be found corresponding to the five layers of DIKWP semantics, then one can further explain: why Prof. Duan believes DeepSeek tech is merely an efficiency improvement of the DIKWP semantic space interaction…” ((PDF) 内部报告《DEEPSEEK 只是DIKWP 语义空间交互提升效率的 ...)
GPT-4 vs Human Ethicist Study: Kang, J. (2024). “GPT-4 is a Moral Expert? Answers 50 Dilemmas, More Popular than NYU Professor”. (Reporting UNC & Allen Institute research) – “OpenAI’s GPT-4 was able to provide moral explanations and advice that people found even more correct, trustworthy, and thoughtful than a renowned human ethicist’s. In blind comparisons on 50 ethical dilemmas, GPT-4’s suggestions were rated higher in quality in almost all aspects (GPT-4o竟是「道德专家」?解答50道难题,比纽约大学教授更受欢迎|图灵|伦理学|哲学家|gpt-4_网易订阅) (GPT-4o竟是「道德专家」?解答50道难题,比纽约大学教授更受欢迎|图灵|伦理学|哲学家|gpt-4_网易订阅).”
Anthropic. (2023). Claude 2 Model Card. – “Claude 2… has shown strong performance in the MMLU benchmark with a score of 78.5 in a 5-shot scenario.” (Anthropic's Claude 2 - AI Model Details)
OpenAI. (2023). GPT-4 Technical Report. – Not directly quoted above, but informs that GPT-4’s training included exposure to correct/incorrect reasoning, helping it learn consistency (Meta Llama 2 vs. OpenAI GPT-4: A Comparative Analysis of an Open-Source vs. Proprietary LLM).
Xu, X. (2024). 36Kr News - “GPT-4 has no real reasoning? 21 types of reasoning tasks all failed.” – Summary: Studies by MIT alumni showed GPT-4 struggled across 21 different reasoning categories, highlighting that large models may not truly “understand” reasoning but mimic it, and calling into question claims of emergent logical ability (GPT-4推理太离谱,大学数理化总分没过半,21类推理题全翻车 - 36氪) (被骗了?GPT-4 其实没有推理能力? - 36氪).
Duan, Y. (2025). “World’s First LLM Consciousness Level White-Box DIKWP Evaluation Report (2025)”. – “Based on the DIKWP model, 100 test questions were carefully designed, divided into four sections: perception & information processing, knowledge construction & reasoning, wisdom application & problem solving, intent recognition & adjustment. Each question has clear scoring criteria…” (科学网-全球首个大语言模型意识水平“识商”白盒DIKWP测评2025报告 ...) (Outlines a structured evaluation method for LLM cognitive abilities.)
21st Century Business Herald. (2025). “DeepSeek Changes AI Future – Top 10 Trends to Watch.” – “Yann LeCun said DeepSeek’s emergence isn’t ‘China beat USA’ but ‘open-source beat closed-source.’ DeepSeek greatly lowered the technical barrier and cost for deploying AI large models, accelerating AI’s commercial proliferation… ushering in AI ubiquity.” (DeepSeek改变AI未来——最应该关注的十大走向 - 21经济网) (DeepSeek改变AI未来——最应该关注的十大走向 - 21经济网)
DeepSeek Team. (2025). DeepSeek-R1 Release Notes. – “Performance on par with OpenAI-o1… Math, code, and reasoning tasks on par with OpenAI-o1” (DeepSeek-R1 Release | DeepSeek API Docs); “Fully open-source model & technical report. 32B & 70B models on par with OpenAI-o1-mini… pushing open AI boundaries.” (DeepSeek-R1 Release | DeepSeek API Docs) (Claims about DeepSeek’s performance and openness.)
转载本文请联系原作者获取授权,同时请注明本文来自段玉聪科学网博客。
链接地址:https://wap.sciencenet.cn/blog-3429562-1473379.html?mobile=1
收藏