What We’re Reading (Week Ending 02 February 2025) - 02 Feb 2025
Reading helps us learn about the world and it is a really important aspect of investing. The late Charlie Munger even went so far as to say that “I don’t think you can get to be a really good investor over a broad range without doing a massive amount of reading.” We (the co-founders of Compounder Fund) read widely across a range of topics, including investing, business, technology, and the world in general. We want to regularly share the best articles we’ve come across recently. Here they are (for the week ending 02 February 2025):
1. DeepSeek: The View from China – Jordan Schneider, Irene Zhang, Angela Shen, and Yiwen
In this newsletter, we share a translation of insights from a January 26 closed-door session hosted by Shixiang 拾象, a VC spun out from Sequoia China. Attended by dozens of AI researchers, investors, and industry insiders, the event captures how the Chinese AI community is processing the DeepSeek shock…
…The CEO of Scale.ai said that DeepSeek has 50,000 chips, but that is definitely not reality. According to public information, DeepSeek had 10,000 old A100 chips and possibly 3,000 H800 cards before the ban. DeepSeek pays great attention to compliance and has not purchased any non-compliant GPUs, so it should have few chips. The way the United States uses GPUs is too extravagant…
…In the short-term, everyone will be driven to think about how to make AI more efficient. In the long-run, questions about computing power will remain. Demand for compute remains strong and no company has enough…
…Why did DeepSeek catch up so fast?
Reasoning models require high-quality data and training. For LLMs or multimodal AI, it’s difficult to catch up with a closed source model from scratch. The architecture of pure reasoning models hasn’t changed much, so it’s easier to catch up in reasoning.
One reason R1 caught up quickly was that the task was not particularly difficult. Reinforcement learning only made the model choices more accurate. R1 did not break through the efficiency of Consensus 32, spending 32 times the efficiency, which is equivalent to moving from deep processing to parallelization, which is not pushing the boundaries of intelligence, just making it easier….
…AI is similar to a step function, where the compute requirements for followers have decreased by a factor of 10. Followers have historically had lower compute costs, but explorers still need to train many models. The exploration of new algorithms and architectures will not stop. Behind the step function, there are significant investments by many people, meaning compute investments will continue to advance. Many resources will also be allocated to products. Apart from reasoning, there are other directions that are compute-intensive. While the vast amount of compute resources spent by explorers may not be visible, without such investment, the next “step” might not occur. Additionally, many are dissatisfied with current architectures and RL methods, and progress will continue.
When exploring directions, performance achieved with 10,000 GPUs may not always be significantly better than that of 1,000 GPUs, but there is a threshold somewhere. It’s unlikely that meaningful results can be achieved with only 100 GPUs because the iteration time for each solution would be too long…
…The question of why OpenAI and Anthropic did not do work in DeepSeek’s direction is a question of company-specific focus. OpenAI and Anthropic might have felt that investing their compute towards other areas was more valuable.
One hypothesis for why DeepSeek was successful is that unlike Big Tech firms, DeepSeek did not work on multi-modality and focused exclusively on language. Big Tech firms’ model capabilities aren’t weak, but they have to maintain a low profile and cannot release too often. Currently, multimodality is not very critical, as intelligence primarily comes from language, and multimodality does not contribute significantly to improving intelligence…
…2025 will, first and foremost, see interest in new architectures beyond Transformers. Some initial exploration is already underway, aiming to reduce costs while pushing the boundaries of intelligence. Secondly, the potential of reinforcement learning (RL) has yet to be tapped into completely. On the product side, there is significant interest in agents, though they have yet to see widespread application…
…It is reported that Meta is still in the process of reproducing DeepSeek, but so far, this has not significantly impacted their infrastructure or long-term roadmap. In the long run, beyond exploring the boundaries of the technology, cost efficiency must also be considered. Lowering costs will let us have more fun…
…From the developer’s perspective, models like Claude-3.5-Sonnet have been specifically trained for tool use, making them highly suitable for agent development. In contrast, models like DeepSeek have not yet focused on this area, but the potential for growth with DeepSeek is immense…
…Currently, reinforcement learning (RL) solves problems with standard answers but has not achieved breakthroughs beyond what AlphaZero accomplished. In fact, it is often simpler. Distillation addresses problems with standard answers, and RL methods work effectively when training with such answers. This explains why distillation and RL have made rapid progress in recent years.
Humanity’s demand for intelligence is vastly underestimated. Many critical problems, such as cancer and SpaceX’s heat shield materials, remain unsolved. Existing AI primarily automates tasks, but there are numerous unsolved challenges ahead. Looking forward, the potential for explosive growth is immense, and the advancement of intelligence cannot stop…
…Domestic Chinese companies were previously constrained by computing power, but now it’s proven that the potential technical space is vast. For more efficient models, we might not need especially large cards — we can provide relatively customized chips that can be adapted for compatibility with AMD and ASIC. From an investment perspective, Nvidia’s moat is very high, but ASIC will have yet greater opportunities.
The DeepSeek situation isn’t really about compute — it’s about America realizing China’s capabilities and efficiency. DeepSeek isn’t Nvidia’s vulnerability; Nvidia will grow as long as AI grows. Nvidia’s strength is its ecosystem, which has been built up over a long time. Indeed, when technology develops rapidly, the ecosystem is crucial. The real crisis comes, though, when technology matures like electricity: it becomes commoditized; then, everyone will focus on products, and many ASIC chips will emerge for specific scenario optimization…
…Open source controls the margins of the whole market. If open source can do 95% of what closed source can do and closed source is too expensive, then open source can be used completely. If the capabilities of open source and closed source do not differ greatly, then this presents a big challenge for closed source…
…AI explorers definitely need more computing power; China, as a follower, can leverage its engineering advantages. How Chinese large-model teams use less computing power to produce results, thereby having some definite resilience — or even doing better — might end up being how the US-China AI landscape plays out in the future.
2. Explaining International Valuations – Daniel Rasmussen
Perhaps the single greatest divergence in equity markets has been the continued outperformance of US versus international equities—and thus the widening of the valuation gap between the US and the rest of the world…
…By far the most significant difference, explaining about half the valuation gap, is the domicile of listing. US-listed stocks are substantially more expensive than internationally listed stocks for no reason other than the place of listing.
It’s particularly interesting that the regression shows having a higher percentage of sales in the US results in cheaper valuations. A key driver of this is that several of the US tech giants most responsible for high US equity valuations having a relatively low percentage of sales in the US (Alphabet, Microsoft, and Tesla at around 50%; Apple, Netflix, Meta, and NVIDIA at around 40%). The big question, then, is why half the valuation gap is explained simply by being listed on US exchanges. Even large internationally listed companies with >40% of their revenue coming from the US, like Toyota, Mitsubishi, Roche or Deutsche Telekom (which owns T-Mobile), trade at steep value multiples relative to US peers.
Were a larger percentage of the valuation gap explained by fundamentals, we’d expect such a gap to persist. But given that the valuation gap is primarily explained simply by the location of listing, we think there’s a strong reason to expect a convergence—and therefore to favor international over US-listed stocks, despite their terrible relative performance over the past decade.
3. The Most Impressive Prediction of All Time – Jeffrey Emanuel
My candidate for the most impressive prediction of all time came from a person who is practically unknown in the West except for a relatively small group of historians and people interested in niche subjects. The person I’m thinking of is named Pyotr Durnovo, and he was an Imperial Russian government official who lived from 1842 to 1915.
We will discuss more about him later and how his life experience may have prepared him to be able to make such an impressive prediction, but the short version of it is that he initially studied to be in the Navy and served there for around a decade, and then became the Director of Police for the Ministry of Internal Affairs for the entire Russian Empire under Tsar Alexander III. Later, he served as the Minister of the Interior under Tsar Nicholas II (the one who was ultimately executed with his family by the Bolsheviks in 1917 during the Russian Revolution).
So what is this prediction he made, anyway, and why is it so impressive? Well, in 1914, six months prior to the outbreak of World War 1, Durnovo wrote a truly remarkable ~7,600-word memorandum for Tsar Nicholas II and his top 2 or 3 ministers, which we know was given to them, since it was found in Nicholas’ papers and later published in 1922 by communist historians after the revolution. If they had only read it carefully and took its warnings more seriously, the world we live in today might look very different!…
…For one, it predicted an imminent war on the horizon, which he ultimately blamed on the collision course between England and Germany, which were the two greatest industrial powers at the time. This was certainly not some earth shattering or special prediction; a lot of people predicted some kind of big conflict, and it was often said that “war was in the air” at the time…
…It’s how he analyzed the situation, and then used that reasoning to predict the exact groupings of countries that would participate in the conflict and on which side, and how the situation would evolve from there, that is so impressive…
…His predictions about alliances and national behaviors were almost unbelievably specific and ran counter to the conventional wisdom of the time:
- He predicted that Italy would not side with Germany despite being part of the Triple Alliance, and would instead join the opposing side if victory seemed likely, seeking territory from both Austria and Turkey. This is exactly what happened; Italy joined the Allies in 1915 after negotiating for territorial concessions.
- He predicted that Romania would remain neutral until it was clear which side would win, then join the victorious side to claim territory. This also came true— Romania entered the war in 1916 on the Allied side after significant Russian successes.
- Most surprsingly, he predicted that Bulgaria would side against Serbia and by extension against Russia, despite Russia being Bulgaria’s historic liberator from Ottoman rule— a prediction that seemed almost unthinkable to most observers at the time. This came true exactly as he foresaw, with Bulgaria joining the Central Powers in 1915.
- He correctly predicted that Serbia and Montenegro would side against Austria, while Greece would likely remain neutral until the outcome was more or less predetermined.
- He predicted unrest among Muslims in the Caucasus and Turkestan (which occurred).
- He predicted the possibility of Afghanistan moving against Russia (which happened in 1919).
- He predicted serious complications in Poland (the Polish-Soviet War of 1919-1921).
- He predicted an uprising in Finland if Sweden joined Germany (Finland did declare independence in 1917)…
…If all of that weren’t already so ridiculous to get right, he went way beyond all that to realize that, regardless of who won, the war would lead to “social revolution” in both the defeated AND victorious countries, starting with the losing side and then spreading to the winners. This was perhaps his most extraordinary prediction, as it came true in spectacular fashion:
- Russia, despite being on the winning side, experienced the Bolshevik Revolution in 1917; we will go into much more detail about these predictions below.
- Germany, after losing the war, experienced the German Revolution of 1918-1919; Durnovo predicted that unrest and revolution would be specifically tied to economic factors and class interests rather than purely political ones: he outlined how German workers would turn against the agricultural interests that had dominated pre-war German policy once defeat cut off their export markets and industrial employment, and this exact dynamic played out in the German Revolution of 1918-1919.
Now, you might object here that “Well, it’s not that crazy to believe there might be a revolution in a country which suffered massive losses in a catastrophic war; lots of people might have predicted that.” But the thing is, Durnovo went so far beyond merely predicting that there would be a Russian Revolution. He basically predicted every contour of the Revolution, the driving forces behind it, how it impacted different segments of Russian society, and how it would all unfold, step by step!…
…So how was Durnovo able to accomplish this incredible feat of prediction? Obviously, he was a genius of the first order, which is perhaps not so surprising given that he was a close relative of the famous Tolstoy family. But raw IQ is certainly not enough, nor is being well informed and knowledgeable. What kind of man could see so clearly what virtually everyone else missed? He was a complex character whose very contradictions likely enabled his extraordinary insights; he was, at the same time:
- A conservative police chief who often expressed liberal thoughts in private
- A supposed reactionary who opposed anti-Semitic measures and defended Jews
- A cynical operator who nevertheless would help others when he could
- A man capable of both strict officialdom and surprising gentleness
- A high official who preferred informal interactions (his subordinates would warn visitors not to address him as “Your Excellency”)
These contradictions suggest someone who wasn’t bound by conventional ideological frameworks or social expectations— a crucial trait for seeing beyond accepted wisdom. He also had a wide range of professional experience that prepared him to see things in a multi-faceted, sophisticated way, as by 1915, he had done the following:
- Naval officer (9 years of far-sea cruises)
- Military legal training
- Assistant Prosecutor in various parts of Russia
- Director of Police Department for 10 years
- Assistant Minister of Interior under multiple ministers
- Minister of Interior
- Member of State Council
This combination of experiences was extraordinary and atypical to say the least:
- His naval and legal background gave him insight into the military, maritime trade, and the Russian legal system.
- His prosecutorial work exposed him to conditions across Russia, not just in the big cities.
- His police work gave him unparalleled insight into social discontent and the strategies and thinking of professional revolutionaries like Lenin, Stalin, and Trotsky.
- His ministerial positions showed him the workings (and limitations) of state power.
He also occupied a unique position as both an insider and an outsider:
- He was from old nobility but not wealthy or particularly influential
- He reached high office but was temporarily dismissed in disgrace (a sordid story in which Durnovo had his secret police officers search the private letters of a foreign ambassador— inside an embassy building no less— so they could steal love letters sent by Durnovo’s mistress to the ambassador; when the ambassador complained to Tsar Alexander III, he was furious, ordering his minister to “remove this swine within twenty-four hours.”)
- He was a conservative who often disagreed with other conservatives
- He understood both state power and its limitations
This dual perspective may have freed him from the groupthink that afflicted both conservative and liberal circles.
4. USA, Inc – Michael Batnick
Consider this face blower of a stat from Goldman: “Since 1992, earnings growth in the US has outpaced earnings in non-US developed economies by an annual average of 2.4 percentage points.”
Most of the world is barely earning more than they were prior to the pandemic. The U.S. looks like an unstoppable freight train…
…The one sided performance has driven valuations between us and the rest of the world to record levels. We’ve all seen a version of these charts before…
…BUT! These charts aren’t comparing apples with apples. Goldman notes that only 1% of the U.K. market is in technology companies. Another example they cite is that energy is 5% of S&P 500 earnings, 19% of UK, and just 1% of Japan. We’re not comparing apples with apples.
They did a great job adjusting for differences in sector weights…
…The U.S. still trades at a premium to the rest of the world ex-India, but not as much as the prior chart would have you believe. Before any adjustments, the Eurozone trades at a 39% discount to the U.S. And after the adjustments, that falls to 23%.
5. DeepSeek FAQ – Ben Thompson
Let’s work backwards: what was the V2 model, and why was it important?
The DeepSeek-V2 model introduced two important breakthroughs: DeepSeekMoE and DeepSeekMLA. The “MoE” in DeepSeekMoE refers to “mixture of experts”. Some models, like GPT-3.5, activate the entire model during both training and inference; it turns out, however, that not every part of the model is necessary for the topic at hand. MoE splits the model into multiple “experts” and only activates the ones that are necessary; GPT-4 was a MoE model that was believed to have 16 experts with approximately 110 billion parameters each.
DeepSeekMoE, as implemented in V2, introduced important innovations on this concept, including differentiating between more finely-grained specialized experts, and shared experts with more generalized capabilities. Critically, DeepSeekMoE also introduced new approaches to load-balancing and routing during training; traditionally MoE increased communications overhead in training in exchange for efficient inference, but DeepSeek’s approach made training more efficient as well.
DeepSeekMLA was an even bigger breakthrough. One of the biggest limitations on inference is the sheer amount of memory required: you both need to load the model into memory and also load the entire context window. Context windows are particularly expensive in terms of memory, as every token requires both a key and corresponding value; DeepSeekMLA, or multi-head latent attention, makes it possible to compress the key-value store, dramatically decreasing memory usage during inference.
I’m not sure I understood any of that.
The key implications of these breakthroughs — and the part you need to understand — only became apparent with V3, which added a new approach to load balancing (further reducing communications overhead) and multi-token prediction in training (further densifying each training step, again reducing overhead): V3 was shockingly cheap to train. DeepSeek claimed the model training took 2,788 thousand H800 GPU hours, which, at a cost of $2/GPU hour, comes out to a mere $5.576 million.
That seems impossibly low.
DeepSeek is clear that these costs are only for the final training run, and exclude all other expenses; from the V3 paper:
Lastly, we emphasize again the economical training costs of DeepSeek-V3, summarized in Table 1, achieved through our optimized co-design of algorithms, frameworks, and hardware. During the pre-training stage, training DeepSeek-V3 on each trillion tokens requires only 180K H800 GPU hours, i.e., 3.7 days on our cluster with 2048 H800 GPUs. Consequently, our pre- training stage is completed in less than two months and costs 2664K GPU hours. Combined with 119K GPU hours for the context length extension and 5K GPU hours for post-training, DeepSeek-V3 costs only 2.788M GPU hours for its full training. Assuming the rental price of the H800 GPU is $2 per GPU hour, our total training costs amount to only $5.576M. Note that the aforementioned costs include only the official training of DeepSeek-V3, excluding the costs associated with prior research and ablation experiments on architectures, algorithms, or data.
So no, you can’t replicate DeepSeek the company for $5.576 million.
I still don’t believe that number.
Actually, the burden of proof is on the doubters, at least once you understand the V3 architecture. Remember that bit about DeepSeekMoE: V3 has 671 billion parameters, but only 37 billion parameters in the active expert are computed per token; this equates to 333.3 billion FLOPs of compute per token. Here I should mention another DeepSeek innovation: while parameters were stored with BF16 or FP32 precision, they were reduced to FP8 precision for calculations; 2048 H800 GPUs have a capacity of 3.97 exoflops, i.e. 3.97 billion billion FLOPS. The training set, meanwhile, consisted of 14.8 trillion tokens; once you do all of the math it becomes apparent that 2.8 million H800 hours is sufficient for training V3. Again, this was just the final run, not the total cost, but it’s a plausible number.
Scale AI CEO Alexandr Wang said they have 50,000 H100s.
I don’t know where Wang got his information; I’m guessing he’s referring to this November 2024 tweet from Dylan Patel, which says that DeepSeek had “over 50k Hopper GPUs”. H800s, however, are Hopper GPUs, they just have much more constrained memory bandwidth than H100s because of U.S. sanctions.
Here’s the thing: a huge number of the innovations I explained above are about overcoming the lack of memory bandwidth implied in using H800s instead of H100s. Moreover, if you actually did the math on the previous question, you would realize that DeepSeek actually had an excess of computing; that’s because DeepSeek actually programmed 20 of the 132 processing units on each H800 specifically to manage cross-chip communications. This is actually impossible to do in CUDA. DeepSeek engineers had to drop down to PTX, a low-level instruction set for Nvidia GPUs that is basically like assembly language. This is an insane level of optimization that only makes sense if you are using H800s.
Meanwhile, DeepSeek also makes their models available for inference: that requires a whole bunch of GPUs above-and-beyond whatever was used for training…
…Is this why all of the Big Tech stock prices are down?
In the long run, model commoditization and cheaper inference — which DeepSeek has also demonstrated — is great for Big Tech. A world where Microsoft gets to provide inference to its customers for a fraction of the cost means that Microsoft has to spend less on data centers and GPUs, or, just as likely, sees dramatically higher usage given that inference is so much cheaper. Another big winner is Amazon: AWS has by-and-large failed to make their own quality model, but that doesn’t matter if there are very high quality open source models that they can serve at far lower costs than expected.
Apple is also a big winner. Dramatically decreased memory requirements for inference make edge inference much more viable, and Apple has the best hardware for exactly that. Apple Silicon uses unified memory, which means that the CPU, GPU, and NPU (neural processing unit) have access to a shared pool of memory; this means that Apple’s high-end hardware actually has the best consumer chip for inference (Nvidia gaming GPUs max out at 32GB of VRAM, while Apple’s chips go up to 192 GB of RAM).
Meta, meanwhile, is the biggest winner of all. I already laid out last fall how every aspect of Meta’s business benefits from AI; a big barrier to realizing that vision is the cost of inference, which means that dramatically cheaper inference — and dramatically cheaper training, given the need for Meta to stay on the cutting edge — makes that vision much more achievable.
Google, meanwhile, is probably in worse shape: a world of decreased hardware requirements lessens the relative advantage they have from TPUs. More importantly, a world of zero-cost inference increases the viability and likelihood of products that displace search; granted, Google gets lower costs as well, but any change from the status quo is probably a net negative…
...How did DeepSeek make R1?
DeepSeek actually made two models: R1 and R1-Zero. I actually think that R1-Zero is the bigger deal…
…R1-Zero, however, drops the HF part — it’s just reinforcement learning. DeepSeek gave the model a set of math, code, and logic questions, and set two reward functions: one for the right answer, and one for the right format that utilized a thinking process. Moreover, the technique was a simple one: instead of trying to evaluate step-by-step (process supervision), or doing a search of all possible answers (a la AlphaGo), DeepSeek encouraged the model to try several different answers at a time and then graded them according to the two reward functions.
What emerged is a model that developed reasoning and chains-of-thought on its own…
…Here again it seems plausible that DeepSeek benefited from distillation, particularly in terms of training R1. That, though, is itself an important takeaway: we have a situation where AI models are teaching AI models, and where AI models are teaching themselves.
Disclaimer: None of the information or analysis presented is intended to form the basis for any offer or recommendation. We currently have a vested interest in Alphabet (parent of Google), Apple, Meta Platforms, Microsoft, Netflix, and Tesla. Holdings are subject to change at any time.