What We’re Reading (Week Ending 09 July 2023) - 09 Jul 2023
Reading helps us learn about the world and it is a really important aspect of investing. The legendary Charlie Munger even goes so far as to say that “I don’t think you can get to be a really good investor over a broad range without doing a massive amount of reading.” We (the co-founders of Compounder Fund) read widely across a range of topics, including investing, business, technology, and the world in general. We want to regularly share the best articles we’ve come across recently. Here they are (for the week ending 09 July 2023):
1. Intellectual Laziness – Philo
The collapse of General Electric stands apart. GE was the bluest of the blue-chips: descended from Thomas Edison and J.P. Morgan, it was one of the original twelve components of the Dow in 1896, and grew to become one of the leading technology giants of the early 20th century. After WWII, GE evolved into an industrial behemoth with dominant positions in a dizzying array of electricity-adjacent markets, from jet engines and turbines to light bulbs and home appliances.
In the 1980s, GE ascended to new heights. Jack Welch took the reins as CEO in 1981, and he established GE a major player in media and financial services while expanding GE’s leadership positions in its industrial markets. For most of the 1990s and 2000s, GE was the most valuable company in America, with a valuation topping out at over $1 trillion (as measured in current dollars). While GE had its skeptics and critics at the time, it was widely seen as a corporate paragon, regularly named by Fortune as the most admired company in the world. Welch was regarded as a management guru, and his underlings were routinely poached to become CEOs at other Fortune 500 companies.
And then, a few years ago, it all unraveled in spectacular fashion. Much of the supposed success from the Welch era of the 1980s and 1990s proved to be illusory, the product of temporary tailwinds and aggressive accounting. GE’s fortunes worsened under the reign of Welch’s handpicked successor, Jeff Immelt, who took over in 2001. Immelt struggled to cope with the problems he inherited, which were compounded by the 2008 financial crisis and major management missteps of his own. In 2017, when the extent of GE’s problems became clear, GE’s stock nose-dived, and Immelt was pushed out…
…Jack Welch had most of the traits we typically associate with a great executive. He was incredibly smart (earning his PhD in chemical engineering in only three years), he was demanding of his subordinates, and he worked tirelessly. He had deep operating experience, he was willing to buck convention, and he produced quantifiable results. He was charismatic, ambitious, and a world-class marketer and publicist. And yet, he will forever be remembered as the father of the biggest corporate disaster in American history…
…The story of the fall of GE is worthy of an authoritative book, and we looked at a pair of early entries a couple of years ago – Lights Out, written by the WSJ journalists that covered its fall, and Hot Seat, Jeff Immelt’s memoir.
Power Failure, weighing in at nearly 800 pages, is the most ambitious yet. The author, William Cohan, did an early-career stint as a junior analyst at GE Capital in the 1980s, before becoming an investment banker and then a business writer, putting him in a unique position to tell the GE story.
What sets Cohan’s effort apart is that he got almost everybody to talk to him for his book. He managed to interview both Jack Welch (before he passed away in 2020) and Jeff Immelt, and many former and current senior GE executives as well. Dozens of GE critics, counterparties, and journalists also weigh in throughout…
…Power Failure also doesn’t really offer an overarching theory of why GE failed. Power Failure lists many different things that went wrong at GE — bad management, bad acquisitions, bad incentives, bad accounting, bad luck — but almost all companies suffer from some of these issues without running into a GE-scale disaster. Maybe the failure of GE was the result of an unlucky confluence of individual problems, but it feels like for a group of smart, hard-working people to produce such an exceptionally catastrophic result, there must be a larger lesson to be drawn.
One possible clue comes from the story of David Cote, a star GE finance executive who rose to become the head of the Appliances division in the 1990s, and was one of five early candidates to succeed Jack Welch as the CEO of GE. However, he was eliminated before the three finalists were chosen, and he was asked to leave GE. It is suggested that Cote was doomed by the divisional assignment he drew; the finalists were the ones who had been assigned to oversee GE’s crown jewels, while he was stuck trying to fix a basket case.
Cote eventually landed a position in 2002 as the CEO of Honeywell, a much smaller industrial conglomerate – Cohan at one point refers to it as a “mini-GE”. Honeywell had been run since 1991 by Larry Bossidy, who before then had spent his career as a top executive at GE, a close associate of Jack Welch…
…Cote had an incredibly successful run at Honeywell, leading it until his retirement in 2017. While GE foundered, Honeywell soared. A $1,000 investment in Honeywell in 2003 would be worth over $9,000 today, while the same investment in GE would now be worth only $450. Remarkably, Honeywell managed to surpass GE in overall value as well: Honeywell’s current market capitalization is $140 billion, while GE is now worth less than $90 billion. GE is slated to be broken up, but as it stands today, is nothing more than a mini-Honeywell.
This would seem to be the perfect natural experiment. A GE cast-off takes over a small company run by Jack Welch’s former right-hand man, and turns it around and surpasses GE. What did Cote do so differently from Welch, Immelt, and Bossidy, to get such a spectacular result?…
…What is Cote’s diagnosis of the root problems at Honeywell? Cote opens the book by telling the story of an internal meeting at the beginning of his tenure, a business review of Honeywell’s Aerospace division. The head of Aerospace was steeped in the old culture, and had even been a candidate for the CEO job that Cote won. The meeting does not start well:
We sat down in a conference room so that team members could present their strategic plan to me. A copy of the plan had been placed on the table facing each seat. Flipping through mine, I saw that it was thick–maybe 150 pages long, full of charts and tables. Uh oh, I thought, not good. I had found so far at Honeywell that executives and managers often made presentations far longer than necessary, overwhelming audience members with facts, figures, and commentary to preempt sharp, critical questioning.
Nevertheless, Cote interrupts them with sharp, critical questions. The Aerospace team responds with annoyance — they had planned to put on a show and receive a pat on the back — but Cote interrogates them about the root cause of the $800 million in cost overruns on their biggest project. The team eventually relents and agrees to probe the root causes of their biggest issues, and they turn the ship around. Cote concludes (emphasis mine):
What I learned, to my chagrin, was that Aerospace had become adept at lying to itself, shoehorning costs here and there into a budget without acknowledging them openly. This put enormous strain on the organization, which then had to patch together aggressive bookkeeping and special deals with customers and others, to make its goals. A dysfunctional approach if I’d ever seen one.
Cote says that this approach was pervasive at Honeywell:
Lacking any drive to think deeply about their businesses, and unchallenged by leadership to do so, teams held meetings that were essentially useless, their presentations clogged up with feel-good jargon, meaningless numbers, and analytic frameworks whose chief purpose was to hide faulty logic and make the business look good. When you did a bit of digging, you found that most executives didn’t understand their businesses very well, or even at all.
Cote defines this as intellectual laziness. It is the tendency of organizations to “juke the stats” and lie to themselves instead of diagnosing and solving root problems. This kind of anecdote is everywhere in Power Failure; recall Steve Burke’s appraisal that GE “never had the intellectual curiosity or the drive” to understand and manage NBCU…
…GE Capital was central to GE’s ability to manipulate reported earnings. Accounting rules allow a company to book a profit whenever they sell an asset for more than they paid for it. In the course of their normal business, GE Capital owned hundreds of billions of dollars of assets, like bonds and office buildings and parking lots (which they funded with short-term and long-term borrowings). Over time, real assets tend to appreciate, at least in nominal terms. Whenever GE was having a bad quarter, they would sell some of these appreciated assets–say, an office building that was bought decades ago for $10 million that was now worth $20 million–and report the $10 million accounting profit as a part of regular earnings, to compensate for the earnings shortfall from the core business. As for GE Capital CEO Gary Wendt put it in Power Failure:
I always had a lot of [asset sales] available for the quarter. I had to because I knew [Jack Welch] was going to call up and say, “I need another $1 million or another $2 million or whatever,” and so I’d go over to [GE Capital CFO James] Parke and I’d say, “Okay, let’s do this one and this one.” Making your earnings was just life to us.
This kind of one-time accounting gain from asset sales is fundamentally different in nature from operating profits from selling jet engines and power turbines. The $20 million office building was already worth $20 million before GE sold it, despite being on the books for $10 million; selling it converts it to cash but does not make shareholders any wealthier (in fact, by triggering a tax bill, it can make them worse off), despite the accounting profit that gets booked. Bundling these kinds of accounting gains with normal operating results only serves to obscure the true performance of the business from investors.
Regardless, the most of the senior GE executives who talked to Cohan continued to stand behind the practice of earnings smoothing:
Over lunch at a Connecticut pub, Denis Nayden, who was one of Wendt’s successors at GE Capital, also defended the practice of harvesting GE Capital assets as needed. “What’s the job of a public company?” he asked rhetorically. “Produce earnings for shareholders.”
“The job of a public company is to produce earnings for shareholders” is a hell of a thing for the former chairman of GE Capital to be saying after the collapse of GE. If you ask GE’s investors, they would say the job of a public company is to make money for shareholders. GE was among the best at consistently “producing earnings” for shareholders; they did so for decades. They were just abysmal at making money.
There is a plethora of ways to produce short-term earnings without making money, and GE somehow seemed to engage in all of them. You can sell appreciated assets to record an accounting profit. You can overpay for assets with high current earnings and poor long-term prospects. You can sell power equipment to Angola on credit, with little hope of ever getting paid in cash. You can book immediate paper profits from the long-tail insurance policies you sell today, and then find out two decades later that your assumptions were too optimistic and you have to come up with $15 billion of cash to plug the gap. There are no magic metrics, and GAAP earnings are as subject to Goodhart’s Law as any other measure.
According to Power Failure, almost every time GE made a major decision that destroyed shareholder value, the obsession with manipulating earnings was front and center in the thought process. GE lost a lot of money in insurance, but why was a manufacturing company in the insurance business in the first place? Well, insurance companies offer a lot of accounting leeway, in terms of the way reserves are taken and assets are sold for profit, and could act as “shock absorbers” that let Jack Welch report smooth earnings when other divisions stumbled.
Why did GE Capital recklessly allow itself to become dependent on funding from short-term commercial paper, a practice that would almost bankrupt it in 2008? Well, short-term borrowing lowers interest expense, which boosts short-term earnings.
Why did GE buy a subprime mortgage broker in 2004? They had just spun off their insurance business, and Immelt felt they needed to replace the earnings that the insurance business had previously generated.
Why did GE keep expanding GE Capital? Well, it was a good way to increase earnings. Why didn’t GE sell out of noncore businesses like real estate and media when valuations were sky-high in the mid-00s? GE didn’t want to lose the earnings those divisions produced. The catastrophic 2015 acquisition of Alstom? Immelt thought the synergies would definitely increase earnings. The mistimed $40 billion stock buyback in 2015? Jeff Immelt decided on a $2 per share earnings target, and wanted to do anything he could to hit that goal. Never in Power Failure does it seem like GE management gave any thought to shareholder value when making major decisions: it was always earnings, earnings, earnings.
Even putting aside the obsession with reported earnings, GE’s culture seems to have been generally lacking in intellectual rigor. GE’s strategies were supported by narratives that sounded compelling at a superficial level, but fell apart under any kind of scrutiny.
A classic example: Jack Welch liked to tell everyone that his brilliant insight about expanding into finance was that it had higher revenue per employee than industrial manufacturing, thus it must be a better business to be in. Of course, that is nonsense: there is no reason to expect there to be any relationship between revenue per employee and return on invested capital.
Welch told this story even after GE learned this lesson the hard way in the 1980s, overpaying to acquire Kidder Peabody, a venerable investment banking firm (investment banking being perhaps the highest revenue per employee business that exists), a deal that was an endless source of trouble, and ultimately led to a $2 billion loss when GE finally got rid of it in 1995. (Cohan discovers when talking to a former director that Welch managed to prevent this massive loss from affecting reported earnings by raiding the reserves of the insurance business.)
Return on invested capital is mostly determined by factors like barriers to entry and sustainable competitive advantage, which GE’s industrial businesses had in spades but which GE Capital completely lacked — after all, money is a commodity. After the financial crisis, GE Capital’s return on invested capital collapsed not because revenue per employee declined, but because GE Capital’s lenders and regulators came to understand the true risk inherent in the business, and demanded higher rates, lower leverage, and closer oversight.
As GE placed no value on intellectual rigor, it is no surprise that they ended up promoting executives on the basis of polish and storytelling ability. So it was that when it came time to pick a new CEO, Welch elevated Jeff Immelt, a slick-talking salesman with little understanding of GE’s businesses and little patience for details, and dismissed David Cote, who would go on to have so much success at Honeywell.
It is not clear that GE’s decision-making process was any worse under Immelt than it was under Welch. Immelt would be skewered by accusations that he encouraged “success theater”, a culture where executives never confronted root problems and pretended everything was going well, but the culture of extreme intellectual laziness certainly dated back to his predecessor. In fact, Welch’s best-selling autobiography was subtitled “Straight from the Gut”.
It would be technically accurate to state that the dramatic collapse of GE resulted from a perfect storm of mistakes — wrong CEO, bad investments, strategic missteps, operational snafus. But underlying all of those seemingly unrelated mistakes was one thing: this culture of intellectual laziness, the willingness to juke the stats and tell comforting stories rather than diagnose and solve root problems. GE failed to create shareholder value because they didn’t really try to create shareholder value; they were content to be able to report some shiny meaningless numbers and a pleasant narrative…
…At this point, we have to ask: how does one identify management teams that demand intellectual rigor, and avoid management teams that are intellectually lazy?
The answer is simple, but not easy. In each example we presented here, the intellectually lazy managers are actually initially exposed when they present their story to a knowledgeable audience. To be sure, they are able to assemble a narrative that sounds convincing to a layman, peppered with vanity metrics and impenetrable business-speak.
However, the narrative is usually all form and no substance, pure business theater. It leans heavily on rhetorical tricks: accounting chicanery employed to meet previously announced financial targets might be rationalized as “exceptional dedication to meeting our public commitments”. (The implication being that if you don’t fudge the numbers, maybe you’re just the type of person that doesn’t take their commitments seriously.)
Nonsense axioms are invented out of thin air – recall the continued insistence of former GE executives that companies must consistently announce growing earnings, in the face of the evidence that most successful companies did no such thing.
Then there is the midwit appeal to complexity: anyone who argues that the narrative is a convoluted, illogical mess is accused of being an ignorant simpleton who is incapable of grasping such sophistication and brilliance.
The intellectually lazy narrative always contains these sorts logical gaps. When confronted about these inconsistencies, managers respond with plausible-sounding non sequiturs, answers that might be accepted by a novice but would never pass muster with anyone with real expertise.
In the case of GE, experienced analysts knew that an inherently cyclical business could not produce perfectly smooth metrics, and they also realized that GE Capital’s reliance on cheap short-term funding was not sustainable — points they raised at the time. At Honeywell, David Cote immediately identified the flaws in the stories that his underlings were telling, and called them out.
2. Value of BRK Float, Buffett Market View etc. – The Brooklyn Investor
For example, it is true that BRK only owns $328 billion in stocks against $500 billion in equity. This looks bearish, compared to say, back in 1994/1995 as you see. That looks like equity exposure of only 66% or so.
But as we all know, BRK has been buying a lot of operating businesses. For example, Burlington Northern now is a wholly owned subsidiary. Owning 100% of something is no less ‘equity exposure’ than owning just some of the stock. Right? So our equity exposure is much higher than 66% if you include all the other operating businesses. What is that number? Let’s say we include equity method investments (which is clearly equity) of $26 billion, and the book value of the Rails, Utilities and Energy business of $140 billion. That’s $166 billion. Add that to the $328 billion stock portfolio and you get $494 billion. And this doesn’t include some stuff in the “Insurance and other” (where I assume manufacturing, services and retail is), and we are already pretty much at 100% equity exposure. That, to me, is as good as “fully invested”.
How is that bearish? It’s not, actually. Bearish is if you take all those businesses / stocks and actually sell it down so your actual net equity exposure to all business is way below your shareholders equity. If you tell me that the above $494 billion is actually $250 billion, and the rest is cash, then I would agree BRK is waiting for the end of the world.
As it stands now? Not at all…
…This is the sort of thing that Buffett would hate because I am going to tell you what he is thinking, and I will do so without having any idea. So, having said that…
Rates are now back up to over 5% on the short end, and almost 4% on the long end (10 year). What does Buffett think of interest rates? Well, he won’t tell you. He will probably tell you he thinks long rates are too low and that it can’t stay low forever, but that’s all.
But let’s see what he is doing to see what he thinks of interest rates. With the long end approaching 4%, does Buffett think bonds are interesting?
Below, I went back through the recent 10-K’s (when you get old, even going back 25 years is recent, lol…) and jotted down the cash and fixed income investments at BRK. This way, we can actually see when he started to get allergic to long term bonds, and then we can see if he is getting interested again.
First of all, I can tell you that fixed income on BRK’s balance sheet has been steadily in the $20s billions, despite net worth, cash etc. increasing over the years. Spoiler alert: in the 2023 10Q, this is still $23 billion, so he is not expressing any interest in bonds yet…
…So when did Buffett start to get away from long bonds? It is clear from the above table that he really started to dislike them in 2003. There is a clear pivot in that year, when cash rose a lot and fixed income investments went down. He seemed fine with bonds in 2001 and 2002, when they were around 5% or so…
…So it is clear that Buffett started to really dislike bonds when it started to go below 5%. I was going to argue 4% is the level, but you see rates above 4% for a few years after 2003, but Buffett didn’t bite; fixed income levels remained low, which seems to suggest 5% is the level he won’t accept anything below. The slight rise in this during the financial crisis could be from the emergency financing he did for GE, BAC and others, but I didn’t check. I think those were factors other than the general level of interest rates, so we can ignore that rise in bond holdings during that period.
So, reasonably or unreasonably, I am going to assume that 5% is the point Buffett won’t go below for long term rates.
3. The Full Story of Large Language Models and RLHF – Marco Ramponi
Language Models (LMs) are a class of probabilistic models explicitly tailored to identify and learn statistical patterns in natural language. The primary function of a language model is to calculate the probability that a word succeeds a given input sentence.
How are these models trained to do this? The core process is a general technique known as self-supervised learning, a learning paradigm that leverages the inherent structure of the data itself to generate labels for training.
In the context of natural language processing, self-supervised learning enables models to learn from unannotated text, rather than relying on manually labeled data, which is relatively scarce and often expensive.
During the training process, an LM is fed with a large corpus (dataset) of text and tasked with predicting the next word in a sentence. In practice, this is often achieved by randomly truncating the last part of an input sentence and training the model to fill in the missing word(s). As the model iterates through numerous examples, it learns to recognize and internalize various linguistic patterns, rules, and relationships between words and concepts. One can say that via this process the model creates an internal representation of language.
The outcome of this training process is a pre-trained language model. By exposure to diverse linguistic patterns, the model is equipped with a foundation for understanding natural language and for generating contextually appropriate and coherent text. Some people refer to such pre-trained models as foundation models…
…How good can a language model become?
As it turns out, the effectiveness of LMs in performing various tasks is largely influenced by the size of their architectures. These architectures are based on artificial neural networks, which are computational models loosely inspired by the structure and functioning of biological neural networks, such as those in the human brain. Artificial neural networks consist of interconnected layers of nodes, or “neurons” which work together to process and learn from data.
Neurons in the network are associated with a set of numbers, commonly referred to as the neural network’s parameters. The numerical value of these parameters is supposed to represent the strength of connections between different neurons. The parameters within a neural network are adjustable, and they get iteratively updated during the training process to minimize the difference between the model’s predictions and the actual target values.
In the context of LMs in particular, larger networks with more parameters have been shown to achieve better performance. Intuitively, the more parameters, the greater their “storage capacity”, even though it should be noted that language models do not store information in a way comparable to the standard way storage memory works in computers (hard drives).
Essentially, a higher number of parameters allows the model to “internalize” a greater variety of statistical patterns (via the numerical relationships of its parameters) within the language data they are exposed to. Larger models, however, also require more computational resources and training data to reach their full potential.
A language model is more than just a neural net.
Modern language models comprise various components or blocks, often formed by different neural networks, each designed to perform specific tasks and featuring specialized architectures. Virtually all current LMs are based on a particularly successful choice of architecture: the so-called Transformer model, invented in 2017.
Starting from the field of Natural Language Processing (NLP), Transformers have been revolutionizing nearly all areas of applied AI, due to their efficiency at processing large chunks of data at once (parallelization) rather than sequentially, a feature that allowed for training on bigger datasets than previous existing architectures. On text data, Transformers have proved exceptionally good at carrying out a form of natural language contextual understanding, which made them the de facto standard choice for most NLP tasks nowadays. Two components are key for this success: the attention mechanism and word embeddings.
- Word Embeddings are high-dimensional vector representations of words that capture their semantic and syntactic properties. These representations enable the model to numerically manipulate words in a mathematical space, a sort of semantic space, where physically nearby words share some form of relationship of meaning or other kinds of similarities. Instead of treating words as isolated entities, word embeddings allow the model to learn and understand the complex interplay of words within a given context.
- Attention Mechanisms allow the model to weigh the importance of different words or phrases in the text. This helps the model to selectively focus on specific parts of the input, assigning different attention scores to the words based on their relevance to the task at hand. Attention can be thought of as a numerical operation that is supposed to mimic the “focusing ability” of a model to the local, specific context as it reads through or generates text…
…Previous prevailing heuristics have long been claiming that increasing the size of a model was the most effective way to improve its performance, while scaling the training datasets was less important. However, more recent research has radically reshaped this perspective, revealing that many of the current LLMs are, in fact, significantly undertrained with respect to the amount of data seen during pre-training.
This fundamental shift has led to the formation of a new set of guiding heuristics, emphasizing the importance of training large models with more extensive datasets. In practice, in order to fully train the next massive LLM following these new principles one would need an immense amount of data, corresponding to a significant fraction, if not all of the text data available on the entire internet today.
The implications of this new perspective are profound. On the one hand, the total amount of training data actually available might turn out to be the true fundamental bottleneck for these AI systems…
…Scaling language models yields more than expected.
With scaling, the performance of LLMs has (predictably) shown consistent improvements across a number of quantitative metrics that are supposed to measure to which extent an LM is able to do what it was primarily designed for: calculate probability distributions over words. An example of such metrics is perplexity, a measure of fluency of generated text.
We have seen, however, how the process of scaling language models requires training them on enormous quantities of data, often sourced from the extensive troves of text available online. LLMs thus get to be “fed” with substantial portions of the web, spanning a vast array of information. Being exposed to such a diverse range of linguistic patterns and structures during training, LLMs progressively learn to emulate and reproduce these patterns with high fidelity.
As a byproduct, this process has appeared to give rise to fascinating qualitative behaviors. Empirical studies have found that, as LLMs are scaled, they are able to suddenly “unlock” new capabilities that seem to emerge in a discontinuous manner, in contrast to the more predictable linear improvement of quantitative metrics.
These emergent abilities encompass a wide range of tasks, such as translation between different languages, the ability to write programming code, and many others. Remarkably, LLMs acquire these skills through the mere observation of recurring patterns in natural language during the training process, that is, without explicit task-specific supervision…
…The phenomenon of emergent abilities in LLMs, although quite recent and still not fully understood by researchers, is also not a completely obscure one.
Even though there is no prediction on exactly which new cognitive capabilities further scaled LLM may acquire in the future, the general pattern that allows this to happen is fairly clear. Let’s consider the example of Question-Answering.
Within this massive language dataset, the internet of text, there exist numerous instances of questions followed by answers. These question-answer pairs occur in diverse contexts, such as forums, articles, or educational resources, and cover a multitude of topics, from everyday trivia to specialized technical knowledge.
Ultimately, a statistically significant number of these answers is in fact correct, and this is reflected in the ability of an LLM to carry out a form of information retrieval from web knowledge, by giving reasonably correct answers to common sense questions on disparate topics when requested to do so.
Unfortunately, the internet is also filled with (a statistically significant amount of) false facts and wrong answers to common sense questions. Due to the sheer volume of this data, it is virtually impossible for the researchers to regulate the content LLMs are exposed to during training.
As a matter of fact, LLMs may occasionally exhibit various types of undesirable behavior, such as reproducing harmful or biased content, or generating so-called hallucinations by fabricating nonexistent or false facts.
When such models are proposed as general purpose conversational chatbots (like ChatGPT), it becomes a lot more difficult to identify all the possible threats that arise from a mass use of these systems, since it is almost impossible to predict a priori all the possible scenarios…
…Can a machine learn human values?
Fundamentally, RLHF is based on a straightforward premise. Imagine having two language models: a baseline (unaligned) model and a secondary preference model. The preference model’s role is to determine which action a human would prefer within a given list of possibilities (e.g., two different responses from the baseline model to a user’s request). This model could assign a numerical score to each action, effectively ranking them according to human preferences. In technical terms, this is known as a reward model.
Utilizing the reward model, the baseline model can be refined iteratively, altering its internal text distribution to prioritize sequences favored by humans (as indicated by the reward model). In some sense, the reward model serves as a means to introduce a “human preference bias” into the baseline model…
…OpenAI has applied the general methodology of RLHF to fine-tune ChatGPT through a three-step process.
The initial step involves collecting human demonstrations using a group of about 40 human annotators for a pre-selected set of prompts. The prompts are sourced from two different origins: some are created by annotators or developers, while others are sampled from OpenAI’s API requests.
These demonstrations can be thought of as the “ideal answers”, or responses to these prompts, and together constitute a training dataset. This dataset is then used to fine-tune a pre-trained model in a supervised manner, yielding the Supervised Fine-Tuned (SFT) model.
As mentioned earlier, this approach has scalability limitations, resulting in a relatively small dataset (approximately 15k examples).
The second step revolves around preference orderings. Labelers (or annotators) are tasked with voting on a number of SFT model outputs, thereby creating a new dataset composed of comparison data. The reward model is trained on this dataset.
In practice, a list of prompts is chosen, and the SFT model generates multiple outputs (between 4 and 9) for each prompt. Annotators rank these outputs from best to worst, forming a new labeled dataset with rankings serving as labels.
Although the exact details remain undisclosed by OpenAI, the dataset’s size may be roughly ten times larger than the curated dataset used for the SFT model.
Finally, the third step involves applying Reinforcement Learning to teach the SFT model the human preference policy through the reward model, essentially as described in the previous section. The SFT model is fine-tuned via the reward model. The outcome is the so-called policy model…
…As we have previously discussed, by treating the language model as a reinforcement learning policy during the fine-tuning phase, RLHF introduces biases into the distribution.
Operationally, we can interpret this effect as the introduction of a mode-seeking behavior which guides the model through the distribution and leads to outputs with higher rewards (as modeled by learned human preferences), effectively narrowing the potential range of generated content…
…While RLHF improves the consistency of the model’s answers, it inevitably does so at the cost of diversity in its generation abilities. This trade-off could be viewed as both a benefit and a limitation, depending on the intended use case.
For instance, in LLM applications such as search engines, where accurate and reliable responses are paramount, RLHF is an ideal solution. On the other hand, when using language models for creative purposes, such as generating novel ideas or assisting in writing, the reduction in output diversity may hinder the exploration of new and intriguing concepts.
4. Why transformative AI is really, really hard to achieve – Arjun Ramani and Zhengdong Wang
Humans have a good track record of innovation. The mechanization of agriculture, steam engines, electricity, modern medicine, computers, and the internet—these technologies radically changed the world. Still, the trend growth rate of GDP per capita in the world’s frontier economy has never exceeded three percent per year.
It is of course possible for growth to accelerate. There was time before growth began, or at least when it was far closer to zero. But the fact that past game-changing technologies have yet to break the three percent threshold gives us a baseline. Only strong evidence should cause us to expect something hugely different.
Yet many people are optimistic that artificial intelligence is up to the job. AI is different from prior technologies, they say, because it is generally capable—able to perform a much wider range of tasks than previous technologies, including the process of innovation itself. Some think it could lead to a “Moore’s Law for everything,” or even risks on on par with those of pandemics and nuclear war. Sam Altman shocked investors when he said that OpenAI would become profitable by first inventing general AI, and then asking it how to make money. Demis Hassabis described DeepMind’s mission at Britain’s Royal Academy four years ago in two steps: “1. Solve Intelligence. 2. Use it to solve everything else.”…
…Neither this essay nor the economic growth literature rules out this possibility. Instead, our aim is to simply temper your expectations. We think AI can be “transformative” in the same way the internet was, raising productivity and changing habits. But many daunting hurdles lie on the way to the accelerating growth rates predicted by some…
…Productivity growth almost definitionally captures when a new technology efficiently performs useful work. A powerful AI could one day perform all productive cognitive and physical labor. If it could automate the process of innovation itself, some economic growth models predict that GDP growth would not just break three percent per capita per year—it would accelerate.
Such a world is hard to achieve. As the economist William Baumol first noted in the 1960s, productivity growth that is unbalanced may be constrained by the weakest sector. To illustrate this, consider a simple economy with two sectors, writing think-pieces and constructing buildings. Imagine that AI speeds up writing but not construction. Productivity increases and the economy grows. However, a think-piece is not a good substitute for a new building. So if the economy still demands what AI does not improve, like construction, those sectors become relatively more valuable and eat into the gains from writing. A 100x boost to writing speed may only lead to a 2x boost to the size of the economy.
This toy example is not all that different from the broad pattern of productivity growth over the past several decades. Eric Helland and Alex Tabarrok wield Baumol in their book Why Are the Prices So Damn High? to explain how technology has boosted the productivity of sectors like manufacturing and agriculture, driving down the relative price of their outputs, like TVs and food, and raising average wages. Yet TVs and food are not good substitutes for labor-intensive services like healthcare and education. Such services have remained important, just like constructing buildings, but have proven hard to make more efficient. So their relative prices have grown, taking up a larger share of our income and weighing on growth…
…Progress in fine motor control has hugely lagged progress in neural language models. Robotics workshops ponder what to do when “just a few cubicles away, progress in generative modeling feels qualitatively even more impressive.” Moravec’s paradox and Steven Pinker’s 1994 observation remain relevant: “The main lesson of thirty-five years of AI research is that the hard problems are easy and the easy problems are hard.” The hardest “easy” problems, like tying one’s shoelaces, remain. Do breakthroughs in robotics easily follow those in generative modeling? That OpenAI disbanded its robotics team is not a strong signal.
It seems highly unlikely to us that growth could greatly accelerate without progress in manipulating the physical world. Many current economic bottlenecks, from housing and healthcare to manufacturing and transportation all have a sizable physical-world component…
…Current methods may also not be enough. Their limits may soon be upon us. Scaling compute another order of magnitude would require hundreds of billions of dollars more spending on hardware. According to SemiAnalysis: “This is not practical, and it is also likely that models cannot scale to this scale, given current error rates and quantization estimates.” The continued falling cost of computation could help. But we may have exhausted the low-hanging fruit in hardware optimization and are now entering an era of deceleration. Moore’s Law has persisted under various guises, but the critical factor for transformative AI may be whether we will reach it before Moore’s Law stops.
Next look at data. Villalobos et al. warns that high quality language data may run out by 2026. The team suggests data efficiency and synthetic data as ways out, but so far these are far from complete solutions as Shumailov et al. shows.
In algorithms, our understanding of what current architectures can and cannot do is improving. Delétang et al. and Dziri et al. identify particularly hard problems for the Transformer architecture. Some say that so-called emergent abilities of large language models could still surprise us. Not necessarily. Schaeffer et al. argues that emergence appears “due the researcher’s choice of metric rather than due to fundamental changes in model behavior with scale.” …
…Humans remain a limiting factor in development. Human feedback makes AI outputs more helpful. Insofar as AI development requires human input, humans will constrain productivity. Millions of humans currently annotate data to train models. Their humanity, especially their expert knowledge and creative spark, becomes more valuable by the day. The Verge reports: “One engineer told me about buying examples of Socratic dialogues for up to $300 a pop.”…
…A big share of human knowledge is tacit, unrecorded, and diffuse… We are constantly surprised in our day jobs as a journalist and AI researcher by how many questions do not have good answers on the internet or in books, but where some expert has a solid answer that they had not bothered to record. And in some cases, as with a master chef or LeBron James, they may not even be capable of making legible how they do what they do.
The idea that diffuse tacit knowledge is pervasive supports the hypothesis that there are diminishing returns to pure, centralized, cerebral intelligence. Some problems, like escaping game-theoretic quagmires or predicting the future, might be just too hard for brains alone, whether biological or artificial…
…The history of economic transformation is one of contingency. Many factors must come together all at once, rather than one factor outweighing all else. Individual technologies only matter to the extent that institutions permit their adoption, incentivize their widespread deployment, and allow for broad-scale social reorganization around the new technology…
…All agree that history is not inevitable. We think this applies to AI as well. Just as we should be skeptical of a Great Man theory of history, we should not be so quick to jump to a Great Technology theory of growth with AI.
And important factors may not be on AI’s side. Major drivers of growth, including demographics and globalization, are going backwards. AI progress may even be accelerating the decoupling of the US and China, reducing the flow of people and ideas.
AI may not be able to automate precisely the sectors most in need of automation. We already “know” how to overcome many major constraints to growth, and have the technology to do so. Yet social and political barriers slow down technology adoption, and sometimes halt it entirely. The same could happen with AI.
Comin and Mestieri observe that cross-country variation in the intensity of use for new technologies explains a large portion of the variation in incomes in the twentieth century. Despite the dream in 1954 that nuclear power would cause electricity to be “too cheap to meter,” nuclear’s share of global primary energy consumption has been stagnant since the 90s. Commercial supersonic flight is outright banned in US airspace…
…Automation alone is not enough for transformative economic growth. History is littered with so-so technologies that have had little transformative impact, as Daron Acemoglu and Simon Johnson note in their new book Power and Progress. Fast-food kiosks are hardly a game-changer compared to human employees. Nobel laureate Robert Fogel documented that in the same way, railroads had little impact on growth because they were only a bit better than their substitutes, canals and roads. Many immediate applications of large language models, from customer service to writing marketing copy, appear similar.7
OpenAI’s own economists estimate that about “19% of jobs have at least 50% of their tasks exposed” to GPT-4 and the various applications that may be built upon it. Some view this as game-changing. We would reframe it. That means over 80% of workers would have less than 50% of their tasks affected, hardly close to full automation. And their methodology suggests that areas where reliability is essential will remain unaffected for some time…
…There is a deeper point here. GDP is a made-up measure of how much some humans value what others produce, a big chunk of which involves doing social things amongst each other. As one of us recently wrote, we may value human-produced outputs precisely because they are scarce. As long as AI-produced outputs cannot substitute for that which is social, and therefore scarce, such outputs will command a growing “human premium,” and produce Baumol-style effects that weigh on growth.
5. Compounding Optimism – Morgan Housel
The question is: Did George Wheelwright know that he would influence Edwin Land, who would then influence Steve Jobs, who would then design a phone that 2.5 billion people would use?
Did Michael Faraday, who died in 1867, know that his ideas would directly influence the light bulb, which effectively led to the creation of everything from the modern power grid to nightlife?
Did Ben Graham know that his 1950s finance class would lead to 45,000 trekking to Omaha every year to hear his student speak?
Of course not. It’s so hard to know what an idea, or an invention, or a philosophy, will influence, and what a person who’s influenced by it will go on to create.
Visa Founder Dee Hock says, “A book is far more than what the author wrote; it is everything you can imagine and read into it as well.” An author might write something that’s dull or obvious, but it could inspire a reader to go do something incredible…
…Most new ideas and inventions are pretty bland on their own. But when you mix several of them together, you can get magic. Plastic is great. Electronics are neat. Metal is special. But mix them together in the right way and you get an iPhone, which is pure magic…
…I think part of the reason pessimism is so much easier and more common than optimism is that compound growth is not intuitive.
It’s hard to imagine, say, our incomes doubling over the next few generations. That seems like such a massive leap, like we’d have to boil the ocean to get it done. But doubling the average income over 30 years works out to about 2.3% growth per year. It’s not crazy at all. It’s actually quite achievable. What made it seem so ambitious to begin with is that compound growth is easy to underestimate.
If you look at the end result of a long period of compounding, it’s astounding. But all it took to get it done was little bits of incremental growth strung together for a long time.
All progress is like that.
Technological progress is easy to underestimate because it’s so counterintuitive to see how, for example, the philosophies of a guy who invented Polaroid film would go on to inspire the iPhone. Or how an 18th-century physicist would write a notebook that would set the foundations for a modern electrical system.
Disclaimer: None of the information or analysis presented is intended to form the basis for any offer or recommendation. We currently have a vested interest in Alphabet (parent of DeepMind), Apple (parent of the iPhone), and Visa. Holdings are subject to change at any time.