||
A critical myth persists: "Intelligence equals lossless compression" or "Lossless compression begets intelligence."
Both are false.
Compression produces intelligence—that's correct. But it's definitely not lossless compression that produces intelligence.
There's a cognitive error at play: many people conflate intelligent compression during training (lossy abstraction) with technical compression for specific applications (lossless encoding/decoding).
Compression has two distinct meanings: first, extracting every drop of insight and regularity from data, approaching theoretical optimality (K-complexity)—this is compression's true definition and intelligence's manifestation. Second, lossless compression, which requires perfect restoration of original data. Lossless compression/restoration isn't a genuine intelligence goal; at most, it's an application requirement (such as in archiving or transmission scenarios).
Lossless compression directly serves lossless restoration, where the lossless standard demands 100% restoration of input information in output (bit-level, including imperfections). Clearly, without resorting to form (instead of meaning), losslessness is ill-defined or meaningless. This differs from ultimate information compression, which targets meaning instead of form. Semantic space embodies the true essence of teh statement "compression equals intelligence." Recognizing this distinction is key to dispelling the myth.
GPT, as a general intelligence agent, derives its core value from creative "distortion" in generation tasks (most application scenarios such as creative writing), while lossless compression is merely a technical byproduct (few application scenarios, such as for storage and transmission), and this capability only weakly correlates with intelligence level (when involving compression ratio, but unrelated to lossless restoration goals).
Attempting to prove model intelligence through lossless compression capability is inappropriate—like measuring legislators' competence by clerks' shorthand speed. These represent fundamentally different pathways:
Intelligent compression pursues minimal causal generation rules (K-complexity pathway), requiring active payment of "abstraction tax"; lossless compression pursues data restoration fidelity, leading to sacrificed model simplicity.
GPT's revolutionary nature lies in the former; the latter is merely a technical byproduct. In mainstream scenarios like generation and reasoning, creativity (or creative distortion) truly represents intelligence's brilliance, though its side effect of hallucination becomes large models' inherent challenge in some specific task scenarios (e.g. summarization, translation).
GPT uses next token prediction as its autoregressive training objective, seemingly a type of formal compression since the next token is its gold standard. But in implementation, it's unmistakably a semantic compression. At the micro level, next token prediction accuracy isn't measured by whether model output tokens match gold standards at the formal level, but through cross-entropy of internal token representations, measuring alignment between output and gold standards in semantic space. At the macro level, GPT trains on big data as a whole, not just targeting individual data points (a passage, a song, or an image). Lossless compression/restoration has clear definition for individual data points (100% formal restoration), but facing big data, this definition becomes impractical (unless for original data storage). In other words, big data compression determines it can only be semantic-level compression, mining the regularity behind big data.
Regarding GPT-enabled lossless restoration applications, large models' theoretical foundation of Kolmogorov complexity (K-complexity) supports the "lossy training-lossless application" framework. K-complexity pursues minimal generation programs, not data restoration capability. During training, lossy compression is the only path to approach K-complexity; during application, lossless restoration benefits from GPT's regularity to achieve unprecedented high compression ratios, as hes been verified by a number of researchers.
Actually, the famous scaling law for large model training emerges from this principle. This empirical observation and insight demonstrate that loss is necessary for intelligence: data must far exceed model size for intelligence improvement (otherwise the model "cheats" by memorizing and overfitting in vast parameters rather than continuously compressing and generalizing).
From another perspective, lossless restoration is an algorithmic property, not directly related to K-complexity. In fact, lossless restoration experiments show algorithms can always achieve lossless goals. Essentially: lossless restoration = model + delta. This delta represents the "abstraction tax" paid by the model—details the model didn't remember and needn't remember. In practice, powerful models yield smaller deltas; weaker models yield larger deltas. Lossless compression algorithms simply play this game. During application, model quality affects efficiency (compression ratio) but doesn't affect losslessness. Delta equals zero means the model remembered every detail, requiring the model to approach infinity or euipped with external massive storage. The other extreme is an infinitely small model or no model, degenerating the system into pure storage (hard disk). Disregarding compression ratio: white noise's K(x)≈|x| can still be precisely restored using lossless compression (like ZIP).
In textbooks, K-complexity is defined as a measure of data's intrinsic structure—"the length of the shortest program that outputs the string"—uncomputable in theory. Lossless compression is viewed as an engineering implementation for precise data restoration. Large models' emergence, verified by multiple scholars, indeed dramatically improves lossless compression ratios but doesn't change lossless compression's nature as merely an engineering tool. Of course, dramatically improved compression ratios also indicate that large models grasp data distribution regularity to unprecedented heights. However, regarding complexity theory, lossless compression/restoration often misleads. But high compression ratios during lossless restoration indeed strongly evidence large models' high intelligence, as no other knowledge system can surpass them.
Additionally, this topic has a crucial temporal dimension. Compression targets historical data, while predictive applications point toward the future (models as prophets), yet restoration only refers to historical data. This means even if lossless compression/restoration achieves ultimate compression ratios, it remains a distance from true predictive capability because there's a temporal wall between them. Crucially, intelligence's essence favors future prediction over historical restoration. Future prediction requires space for random sampling, but historical restoration precisely kills this beneficial randomness.
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2025-7-9 06:36
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社