What We’re Reading (Week Ending 17 March 2024)

What We’re Reading (Week Ending 17 March 2024) -

Reading helps us learn about the world and it is a really important aspect of investing. The late Charlie Munger even went so far as to say that “I don’t think you can get to be a really good investor over a broad range without doing a massive amount of reading.” We (the co-founders of Compounder Fund) read widely across a range of topics, including investing, business, technology, and the world in general. We want to regularly share the best articles we’ve come across recently. Here they are (for the week ending 17 March 2024):

1. The Ultra-Pure, Super-Secret Sand That Makes Your Phone Possible – Vince Beiser

Spruce Pine is not a wealthy place. Its downtown consists of a somnambulant train station across the street from a couple of blocks of two‑story brick buildings, including a long‑closed movie theater and several empty storefronts.

The wooded mountains surrounding it, though, are rich in all kinds of desirable rocks, some valued for their industrial uses, some for their pure prettiness. But it’s the mineral in Glover’s bag—snowy white grains, soft as powdered sugar—that is by far the most important these days. It’s quartz, but not just any quartz. Spruce Pine, it turns out, is the source of the purest natural quartz—a species of pristine sand—ever found on Earth. This ultra‑elite deposit of silicon dioxide particles plays a key role in manufacturing the silicon used to make computer chips. In fact, there’s an excellent chance the chip that makes your laptop or cell phone work was made using sand from this obscure Appalachian backwater. “It’s a billion‑dollar industry here,” Glover says with a hooting laugh. “Can’t tell by driving through here. You’d never know it.”

In the 21st century, sand has become more important than ever, and in more ways than ever. This is the digital age, in which the jobs we work at, the entertainment we divert ourselves with, and the ways we communicate with one another are increasingly defined by the internet and the computers, tablets, and cell phones that connect us to it. None of this would be possible were it not for sand.

Most of the world’s sand grains are composed of quartz, which is a form of silicon dioxide, also known as silica. High‑purity silicon dioxide particles are the essential raw materials from which we make computer chips, fiber‑optic cables, and other high‑tech hardware—the physical components on which the virtual world runs. The quantity of quartz used for these products is minuscule compared to the mountains of it used for concrete or land reclamation. But its impact is immeasurable…

…In the mid‑1950s, thousands of miles from North Carolina, a group of engineers in California began working on an invention that would become the foundation of the computer industry. William Shockley, a pathbreaking engineer at Bell Labs who had helped invent the transistor, had left to set up his own company in Mountain View, California, a sleepy town about an hour south of San Francisco, near where he had grown up. Stanford University was nearby, and General Electric and IBM had facilities in the area, as well as a new company called Hewlett‑Packard. But the area known at the time as the Santa Clara Valley was still mostly filled with apricot, pear, and plum orchards. It would soon become much better known by a new nickname: Silicon Valley.

At the time, the transistor market was heating up fast. Texas Instruments, Motorola, and other companies were all competing to come up with smaller, more efficient transistors to use in, among other products, computers. The first American computer, dubbed ENIAC, was developed by the army during World War II; it was 100 feet long and 10 feet high, and it ran on 18,000 vacuum tubes.

Transistors, which are tiny electronic switches that control the flow of electricity, offered a way to replace those tubes and make these new machines even more powerful while shrinking their tumid footprint. Semiconductors—a small class of elements, including germanium and silicon, which conduct electricity at certain temperatures while blocking it at others—looked like promising materials for making those transistors.

At Shockley’s startup, a flock of young PhDs began each morning by firing up kilns to thousands of degrees and melting down germanium and silicon. Tom Wolfe once described the scene in Esquire magazine: “They wore white lab coats, goggles, and work gloves. When they opened the kiln doors weird streaks of orange and white light went across their faces . . . they lowered a small mechanical column into the goo so that crystals formed on the bottom of the column, and they pulled the crystal out and tried to get a grip on it with tweezers, and put it under microscopes and cut it with diamond cutters, among other things, into minute slices, wafers, chips; there were no names in electronics for these tiny forms.”

Shockley became convinced that silicon was the more promising material and shifted his focus accordingly. “Since he already had the first and most famous semiconductor research and manufacturing company, everyone who had been working with germanium stopped and switched to silicon,” writes Joel Shurkin in his biography of Shockley, Broken Genius. “Indeed, without his decision, we would speak of Germanium Valley.”

Shockley was a genius, but by all accounts he was also a lousy boss. Within a couple of years, several of his most talented engineers had jumped ship to start their own company, which they dubbed Fairchild Semiconductor. One of them was Robert Noyce, a laid‑back but brilliant engineer, only in his mid‑20s but already famous for his expertise with transistors.

The breakthrough came in 1959, when Noyce and his colleagues figured out a way to cram several transistors onto a single fingernail‑sized sliver of high‑purity silicon. At almost the same time, Texas Instruments developed a similar gadget made from germanium. Noyce’s, though, was more efficient, and it soon dominated the market. NASA selected Fairchild’s microchip for use in the space program, and sales soon shot from almost nothing to $130 million a year. In 1968, Noyce left to found his own company. He called it Intel, and it soon dominated the nascent industry of programmable computer chips.

Intel’s first commercial chip, released in 1971, contained 2,250 transistors. Today’s computer chips are often packed with transistors numbering in the billions. Those tiny electronic squares and rectangles are the brains that run our computers, the Internet, and the entire digital world. Google, Amazon, Apple, Microsoft, the computer systems that underpin the work of everything from the Pentagon to your local bank—all of this and much more is based on sand, remade as silicon chips.

Making those chips is a fiendishly complicated process. They require essentially pure silicon. The slightest impurity can throw their tiny systems out of whack.

Finding silicon is easy. It’s one of the most abundant elements on Earth. It shows up practically everywhere bound together with oxygen to form SiO2, aka quartz. The problem is that it never occurs naturally in pure, elemental form. Separating out the silicon takes considerable doing.

Step one is to take high‑purity silica sand, the kind used for glass. (Lump quartz is also sometimes used.) That quartz is then blasted in a powerful electric furnace, creating a chemical reaction that separates out much of the oxygen. That leaves you with what is called silicon metal, which is about 99 percent pure silicon. But that’s not nearly good enough for high‑tech uses. Silicon for solar panels has to be 99.999999 percent pure—six 9s after the decimal. Computer chips are even more demanding. Their silicon needs to be 99.99999999999 percent pure—eleven 9s. “We are talking of one lonely atom of something that is not silicon among billions of silicon companions,” writes geologist Michael Welland in Sand: The Never-Ending Story.

Getting there requires treating the silicon metal with a series of complex chemical processes. The first round of these converts the silicon metal into two compounds. One is silicon tetrachloride, which is the primary ingredient used to make the glass cores of optical fibers. The other is trichlorosilane, which is treated further to become polysilicon, an extremely pure form of silicon that will go on to become the key ingredient in solar cells and computer chips.

Each of these steps might be carried out by more than one company, and the price of the material rises sharply at each step. That first‑step, 99 percent pure silicon metal goes for about $1 a pound; polysilicon can cost 10 times as much.

The next step is to melt down the polysilicon. But you can’t just throw this exquisitely refined material in a cook pot. If the molten silicon comes into contact with even the tiniest amount of the wrong substance, it causes a ruinous chemical reaction. You need crucibles made from the one substance that has both the strength to withstand the heat required to melt polysilicon, and a molecular composition that won’t infect it. That substance is pure quartz.

THIS IS WHERE Spruce Pine quartz comes in. It’s the world’s primary source of the raw material needed to make the fused‑quartz crucibles in which computer‑chip‑grade polysilicon is melted. A fire in 2008 at one of the main quartz facilities in Spruce Pine for a time all but shut off the supply of high‑purity quartz to the world market, sending shivers through the industry.

Today one company dominates production of Spruce Pine quartz. Unimin, an outfit founded in 1970, has gradually bought up Spruce Pine area mines and bought out competitors, until today the company’s North Carolina quartz operations supply most of the world’s high‑ and ultra‑high‑purity quartz. (Unimin itself is now a division of a Belgian mining conglomerate, Sibelco.)

In recent years, another company, the imaginatively titled Quartz Corp, has managed to grab a small share of the Spruce Pine market. There are a very few other places around the world producing high‑purity quartz, and many other places where companies are looking hard for more. But Unimin controls the bulk of the trade.

The quartz for the crucibles, like the silicon they will produce, needs to be almost absolutely pure, purged as thoroughly as possible of other elements. Spruce Pine quartz is highly pure to begin with, and purer still after being put through several rounds of froth flotation. But some of the grains may still have what Glover calls interstitial crystalline contamination—molecules of other minerals attached to the quartz molecules.

That’s frustratingly common. “I’ve evaluated thousands of quartz samples from all over the world,” says John Schlanz, chief minerals processing engineer at the Minerals Research Laboratory in Asheville, about an hour from Spruce Pine. “Near all of them have contaminate locked in the quartz grains that you can’t get out.”

Some Spruce Pine quartz is flawed in this way. Those grains are used for high‑end beach sand and golf course bunkers—most famously the salt‑white traps of Augusta National Golf Club, site of the iconic Masters Tournament. A golf course in the oil‑drunk United Arab Emirates imported 4,000 tons of this sand in 2008 to make sure its sand traps were world‑class, too.

The very best Spruce Pine quartz, however, has an open crystalline structure, which means that hydrofluoric acid can be injected right into the crystal molecules to dissolve any lingering traces of feldspar or iron, taking the purity up another notch. Technicians take it one step further by reacting the quartz with chlorine or hydrochloric acid at high temperatures, then putting it through one or two more trade‑secret steps of physical and chemical processing.

The result is what Unimin markets as Iota quartz, the industry standard of purity. The basic Iota quartz is 99.998 percent pure SiO2. It is used to make things like halogen lamps and photovoltaic cells, but it’s not good enough to make those crucibles in which polysilicon is melted. For that you need Iota 6, or the tip‑top of the line, Iota 8, which clocks in at 99.9992 percent purity—meaning for every one billion molecules of SiO , there are only 80 molecules of impurities. Iota 8 sells for up to $10,000 a ton. Regular construction sand, at the other end of the sand scale, can be had for a few dollars per ton…

…Unimin sells this ultra‑high‑purity quartz sand to companies like General Electric, which melts it, spins it, and fuses it into what looks like a salad bowl made of milky glass: the crucible. “It’s safe to say the vast majority of those crucibles are made from Spruce Pine quartz,” Schlanz says.

The polysilicon is placed in those quartz crucibles, melted down, and set spinning. Then a silicon seed crystal about the size of a pencil is lowered into it, spinning in the opposite direction. The seed crystal is slowly withdrawn, pulling behind it what is now a single giant silicon crystal. These dark, shiny crystals, weighing about 220 pounds, are called ingots.

The ingots are sliced into thin wafers. Some are sold to solar cell manufacturers. Ingots of the highest purity are polished to mirror smoothness and sold to a chipmaker like Intel. It’s a thriving multi-billion dollar industry in 2012.

The chipmaker imprints patterns of transistors on the wafer using a process called photolithography. Copper is implanted to link those billions of transistors to form integrated circuits. Even a minute particle of dust can ruin the chip’s intricate circuitry, so all of this happens in what’s called a clean room, where purifiers keep the air thousands of times cleaner than a hospital operating room. Technicians dress in an all‑covering white uniform affectionately known as a bunny suit. To ensure the wafers don’t get contaminated during manufacture, many of the tools used to move and manipulate them are, like the crucibles, made from high‑purity quartz.

The wafers are then cut into tiny, unbelievably thin quadrangular chips—computer chips, the brains inside your mobile phone or laptop. The whole process requires hundreds of precise, carefully controlled steps. The chip that results is easily one of the most complicated man‑made objects on Earth, yet made with the most common stuff on Earth: humble sand.

The total amount of high‑purity quartz produced worldwide each year is estimated at 30,000 tons—less than the amount of construction sand produced in the United States every hour. (And even construction sand is in high demand; there’s a thriving black market in the stuff.) Only Unimin knows exactly how much Spruce Pine quartz is produced, because it doesn’t publish any production figures. It is an organization famously big on secrecy. “Spruce Pine used to be mom‑and‑ pop operations,” Schlanz says. “When I first worked up there, you could just walk into any of the operations. You could just go across the street and borrow a piece of equipment.”

NOWADAYS UNIMIN WON’T even allow staff of the Minerals Research Laboratory inside the mines or processing facilities. Contractors brought in to do repair work have to sign confidentiality agreements. Whenever possible, vice‑president Richard Zielke recently declared in court papers, the company splits up the work among different contractors so that no individual can learn too much.

Unimin buys equipment and parts from multiple vendors for the same reason. Glover has heard of contractors being blindfolded inside the processing plants until they arrive at the specific area where their jobs are and of an employee who was fired on the spot for bringing someone in without authorization. He says the company doesn’t even allow its employees to socialize with those of their competitors.

It was hard to check out Glover’s stories, because Unimin wouldn’t talk to me. Unlike most big corporations, its website lists no contact for a press spokesperson or public relations representative. Several emails to their general inquiries address went unanswered. When I called the company’s headquarters in Connecticut, the woman who answered the phone seemed mystified by the concept of a journalist wanting to ask questions.

She put me on hold for a few minutes, then came back to tell me the company has no PR department, but that if I faxed (faxed!) her my questions, someone might get back to me. Eventually I got in touch with a Unimin executive who asked me to send her my questions by email. I did so. The response: “Unfortunately, we are not in a position to provide answers at this point in time.”

2. It was never about LLM performance – Justin

The LLM community is obsessed with benchmarking model performance. Mistral released their new “flagship” model this week, and immediately focused the discussion on how it performs on “commonly used benchmarks” relative to other models:

The entire blog post (I’d recommend reading it) is just a read through of how this model performs relative to other models on benchmarks, from math and coding to multilingual capabilities…

…This tendency to fixate on benchmarks is understandable – right now, it’s basically the only semi-objective way to measure how these models stack up against each other. It’s something vendors in other spaces, like data streaming, do too. But it is dangerous because it misses the point of where this whole AI thing is going, and is a textbook product marketing anti-pattern.

In a trend that we’ve seen hundreds of times in developer tooling, the underlying LLM is not going to matter within a few years. Large Language Model performance is already highly commoditized, and will continue to head in that direction. All that will matter is the experience that you build on top of these models, and what that enables for your customers.

Let’s take a look at the ChatGPT interface. Here’s a common prompt I’ve been using for testing, asking the model to summarize the contents of an external link into a tweet thread. Unrelated aside, the responses to this prompt are virtually identical across every major LLM.

Which parts of this interface are the underlying model – GPT-4 in this case – and which are an experience built by OpenAI on top of the underlying model?

The text response, minus any formatting, is what the model generated. But the:

  • Ability of the model to access and scrape content from a web page
  • Context of the prompt, including setting the system as a helpful assistant
  • Formatting the response, like changing the numbers to gray UI for typing the prompt
  • Filepicker for attaching media to the prompt
  • Prompt history
  • Model switcher / picker (this one is meta)
  • Ability to persist and share the model responses

…and more not show here

are all not GPT-4, they’re features built by OpenAI on top of GPT-4 to create an experience that is helpful and worth paying for. Some of these are harder to build than others – OpenAI’s secret sauce obviously isn’t the little arrow that scrolls down to the bottom of the response. ChatGPT would be nothing without GPT-4 – but the reverse may also be true!

The retort to this line of reasoning is that these chat interfaces are primarily for non-technical users, while the real money for these model providers comes from developer use cases, building LLMs into user-facing applications. I’ve worked closely with one of the major model compute providers, so this is not foreign to me. But experience matters to developers too!

OpenAI has dedicated significant resources to building a seamless developer experience beyond “docs for the model.” Here’s their playground for prompting GPT models – you can adjust parameters like temperature and penalties, plus change the system prompt to be any other style…

…For a closed source model provider like OpenAI, the difference between what is model and what is experience is academic – you’re paying for both. They are one thing. But where this really matters is in open source. Does the convergence of open source performance to closed source performance really matter if the experience of using that open source is bad?…

…The open source discussion has been too anchored on reaching performance parity with OpenAI models. This is a small piece of the puzzle. For developers looking to build applications with these open source models, and especially the pro-sumer chat use case, users need to consider the holistic experience that model providers offer. Integrating LLMs into your app is almost never going to be the “drop in” experience you see on marketing sites – and my concern is that the “open source is approaching parity with OpenAI!” narrative is not actually true in a meaningful way.

Folks working in AI can look to previous examples of this phenomenon in developer tools for guidance: A couple of years ago, I wrote about how underlying performance of production relational databases is becoming commoditized, and vendors are focusing much more on developer experience. It’s going to happen here too, the question is just when.

3. Aravind Srinivas – Building An Answer Engine – Patrick O’Shaughnessy and Aravind Srinivas

Patrick: [00:07:28] It’s really cool to think about the sequencing to get there. We’ve had search engines. Like you said, it’s a hack to get the answers. You’re building what I think of today as an answer engine. I type something in, it’s just giving the answer directly with great citation and all this other stuff we’ll talk about. And the vision you’re articulating is this question engine can anticipate the things that I want to learn about and give them to me beforehand.

And I’d love to build up towards that. So maybe starting with the answer engine, explain to us how it works. Maybe you could do this via the time line of how you’ve built the product or something. But what are the components? What is happening behind the scenes when I type something into Perplexity either a question or a search query or whatever? Walk us through in some detail the actual goings on behind the scenes in terms of how the product works itself?

Aravind: [00:08:13] Yes. So when you type in a question into Perplexity, the first thing that happens is, it first reformulates the question, it tries to understand the question better, expands the question in terms of adding more suffixes or prefixes to it, to make it more well formatted. It speaks to the question engine part. And then after that, it goes and pulls so many links from the web that are relevant to this reformulated question.

There are so many paragraphs in each of those links. It takes only the relevant paragraphs from each of those links. And then an AI model, we typically call it large language model. It’s basically a model that’s been trained to predict the next word on the Internet and fine-tuned for being good at summarization and chats.

That AI model looks at all these chunks of knowledge the bits of study that surface from important or relevant links and takes only those parts that are relevant to answering your query and gives you a very concise four or five sentence answer, but also with references. Every sentence has a reference to which webpage or which chunk of knowledge it took from which webpage and puts it at the top in terms of sources.

That gets you to a nicely formatted rendered answer, sometimes in markdown bullets, or sometimes just generic paragraphs, sometimes it has images in it. But a great answer with references or citation so that if you want to dig deeper, you can go and visit the link. If you don’t want and just read the answer and ask a follow-up, you can engage in a conversation, both modes of usage are encouraged and allowed. So this is what happens on Perplexity today.

Patrick: [00:09:51] What percent of users end up clicking beneath the summarized answer into a source webpage?

Aravind: [00:10:01] At least 10%.

Patrick: [00:10:02] So 90% of the time, they’re just satisfied with what you give them?

Aravind: [00:10:06] It depends on how you look at it. If you wanted to be 100% of the time, people always click on a link, that’s the traditional Google. And you want to be 100% of the time where people never click on links, that’s ChatGPT. We think the sweet spot is somewhere in the middle. People should click on link sometimes to go do their work there. Let’s say, you’re just booking a ticket, you might actually want to go away Expedia or something.

Let’s say you’re deciding where to go first. You don’t need to go away and read all these SEO blogs and get confused on what you want to do. You first make your decision independently with this research body that’s helping you decide. And once you finished your research and you have decided, then that’s when you actually have to go out and do your actual action of booking your ticket. That way, I believe there is a nice sweet spot of one product providing you both the navigational search experience as well as the answer engine experience together. And that’s what we strive to be doing…

Patrick: [00:13:54] Can you explain from an insider’s perspective and someone building an application on top of these incredible new technologies, what do you think the future might look like or even what you think the ideal future would be for how many different LLM providers there are, how specialized they get to scale the primary answer, so there’s only going to be a few of them. How do you think about all this and where you think it might go?

Aravind: [00:14:16] It really depends on who you’re building for. If you’re building for consumers, you do want to build a scalable infrastructure because you do want to ask many consumers to use your product. If you’re building for the enterprise, you still want a scalable infrastructure.

Now it really depends, are you building for the people within that company who are using your product. Let’s say, you’re building an internal search engine, you only need to scale to the size of the largest organization, which is like maybe 100,000 people. And not all of them will be using your thing at one moment. You’re decentralizing it, you’re going to keep different servers for different companies and you can elastically decide what’s the level of throughput you need to offer.

But then if you’re solving another enterprise’s problem, where that enterprise is serving consumers and you’re helping them do that, you need to build scalable infrastructure indirectly at least. For example, OpenAI. Their APIs are used by us, other people to serve a lot of consumers. So unless they solve that problem themselves, they’re unable to help other people solve their problem. Same thing with AWS.

So that’s one advantage you have of actually having a first-party product that your infrastructure is helping you serve. And by doing that, by forcing yourself to solve that hard problem, whatever you build can be used by others as well. Amazon build AWS first for Amazon. And because Amazon.com requires very robust infrastructure, that can be used by so many other people and so many other companies emerged by building on top of AWS.

Same thing happened with OpenAI. They needed robust infrastructure to serve the GPT-3 developer API and ChatGPT as a product. But once they got it all right, then they can now support other companies that are building on top of them. So it really depends on what’s your end goal and who you’re trying to serve and what’s the scale of our ambition…

Patrick: [00:19:02] And when I think about the history of the product, which I was a pretty early user of, the first thing that pops to my mind is that it solves the hallucination problem, which has become less of a problem. But early on, everyone just didn’t know how to trust these things and you solved that. You gave citations, you can click through the underlying webpages, et cetera.

I’d love you to walk through what you view the major time line product milestones have been of Perplexity dating back to its start. The one I just gave could be one example. There was this possibility, but there was a problem and you solved it, at least that was my perception as a user. What have been the major milestones as you think back on the product and how it’s gotten better?

Aravind: [00:19:41] I would say the first major thing we did is really making the product a lot faster. When we first launched, the latency for every query was seven seconds, then we actually had to speed up the demo video to put it on Twitter so that it doesn’t look embarrassing.

And one of our early friendly investors, Daniel Gross who co-invests a lot with Nat Friedman, he was one of our first testers before we even released the product. And he said, you guys should call it a submit button for a query. It’s almost like you’re submitting a job and waiting on the cluster to get back. It’s that slow.

And now we are widely regarded as the fastest chatbot out there. Some people even come and ask me, why are you only as fast as ChatGPT? Why are you not faster? And little did they realize that ChatGPT doesn’t even use the web by default. It only uses it on the browsing mode on Bing.

So for us to be as fast as ChatGPT already tells you that in spite of doing more work to go pull up links from the web, read the chunks, pick the relevant ones and use that to give you the answer with sources and a lot more work on the rendering, despite doing all the additional work, if you’re managing an end-to-end latency as good as ChatGPT that shows we have like even a superior back end to them.

So I’m most proud about the speed at which we can do things today compared to when we launched, the accuracy has been constantly going up, primarily few things. One is we keep expanding our index and like keep improving the quality of the index. From the beginning, we knew all the mistakes that previous Google competitors did, which is obsessed about the size of your index and focus less on the quality.

So we decided from the beginning we would not obsess about the size. Size doesn’t matter and index actually, what matters is the quality of your index. What kind of domains are important for AI chatbots and question-answering and knowledge workers. That is what we care about. So that decision ended up being right.

The other thing that has helped us improve the accuracy was training these models to be focused on hallucinations. When you don’t have enough information in the search snippets, try to just say I don’t know, instead of making up things. LLMs are conditioned to always be helpful, will always try to serve the user’s query despite what it has access to, may not be even sufficient to answer the query. So that part took some reprogramming, rewiring. You’ve got to go and change the ways. You can’t just solve this with prompt engineering. So we have spent a lot of work on that.

The other thing I’m really proud about is getting our own inference infrastructure. So when you have to move outside the OpenAI models to serve your product, everybody thinks, “Oh, you just train a model to be as good as GPT and you’re’ done.” But reality is OpenAI’s mode is not just in the fact that they have trained the best models, but also that they have the most cost-efficient, scalable infrastructure for serving this on a large-scale consumer product like ChatGPT. That is itself a separate layer of mode. You can build that mode, you can build.

And so we are very proud of our inference team, how fast, high throughput, low latency infrastructure we built for serving our own LLMs. We took advantage of the open source revolution, Llama and Mistral and took all these models, trained them to be very good at being great answer bots and served them ourselves on GPU so that we get better margins on our product. So all these three layers, both in terms of speed through actual product back-end orchestration, accuracy of the AI models and serving our own AI models, we’ve done a lot of work on all these things…

Patrick: [00:28:50] Can you expand on index. You’ve referenced that a few times for those that haven’t built one or haven’t thought about this. Just explain that whole concept and the decisions that you’ve made. You already mentioned quality versus size. But just explain what it means to build an index, why it’s so important, et cetera?

Aravind: [00:29:07] Yes. So what does an index mean, it’s basically a copy of the web. The web has so many links and you want a cache, you want a copy of all those links in the database, so a URL and the contents in that URL. Now the challenge here is new links are being created every day on the web and also existing links keep getting updated on the web as well. New sites keep getting updated. So you’ve got to periodically refresh them. The URL needs to be updated in the cache with a different version of it.

Similarly, you got to keep adding new URLs to your index, which means you’ve got to build a crawler. And then how you store a URL, the contents in that URL also matters. Not every page is native HTML anymore. The web has upgraded a lot, rendering JavaScript a lot, and every domain has custom-based rendered the JavaScript. So you’ve got to build parsers. So you’ve got to build a crawler, indexer, parser and that together makes up for a great index.

Now the next step comes to retrieval, which is now that you have those index, every time you hit a query, which links do you use? And which paragraphs in those links do you use? Now that is the ranking problem. How do you figure out what is relevance and ranking? And once you retrieve those chunks, like the top few chunks relevant to a query that the user is asking, that’s when the AI model comes in. So this is a retrieve part. Now the generic part. That’s why it’s called retrieve and generic.

So once you retrieve the relevant chunks from the huge index that you have, the AI model will come and read those chunks and then give you the answer. Doing this ensures that you don’t have to keep training the AI model to be up to date. What you want the AI model to do is to be intelligent, to be a good reasoning model.

Think about this as when you were a student, I’m sure you would have written an open book exam, open notes exam in school or high school or college. What are those exams test you for? They don’t test you for rote learning. So it doesn’t give an advantage to the person who has the best memory power. It gives advantage to person who has read the concepts, can immediately query the right part of the notes, but the questions required you to think on the fly as well.

That’s what we want to design systems. It’s very different philosophy from OpenAI, where OpenAI wants this one model that’s so intelligent, so smart, you can just ask it anything. It’s going to tell you. We rather want to build a small efficient model that’s smart, capable, can reason on facts that it’s given on the fly. And this ambiguate different individuals with different names or saved as not sufficient information, not get confused about dates.

When you’re asking something about the future, say that was not yet happened. These sort of corner cases handle all of those with good reasoning capabilities yet have access to all of the world’s knowledge at an instant through a great index. And if you can do both of these together end-to-end orchestrated with great latency and user experience, you’re creating something extremely valuable. So that’s what we want to build…

Patrick: [00:37:26] Do you think that the transformer architecture is here to stay and will remain the dominant tool or architecture for a long time?

Aravind: [00:37:33] This is a question that everybody asks in the last six years or seven years since the first transformer came. Honestly, nothing has changed. The only thing that has changed is the transformer became a mixture of experts model, where there are multiple models and not just a single model. But the core self-attention model architecture has not changed. And people say there are shortcomings, the quadratic attention, complexities there. But any solution to that incurs costs somewhere else too.

Most of the people are not aware that majority of the computation in a large transformer like GPT-3 or 4 is not even spent on the attention layer. It’s actually spent on the matrix multiplies. So if you’re trying to focus more on the quadratic part, you’re incurring costs and the matrix multiples, and that’s actually the bottleneck in the larger scaling.

So honestly, it’s very hard to make an innovation on the transformer that can have a material impact at the level of GPT-4 complex cost of training those models. So I would bet more on innovations, auxiliary layers, like retrievable augmented generation. Why do you want to train a really large model when you don’t have to memorize all the facts from the Internet, when you literally have to just be a good reasoning model?

Nobody is going to value Patrick for knowing all facts. They’re going to value you for being an intelligent person, fluid intelligence. If I give you something very new that nobody else has an experience in, are you well positioned to learn that skill fast and start doing it really well. When you hire a new employee, what do you care about? Do you care about how much they know about something? Or do you care about whether you can give them any task and they would still get up to speed and do it, which employee would you value more?

So that’s the sort of intelligence that we should bake into these models, and that requires you to think more on the data. What are these models training on? Can we make them train on something else and just memorizing all the words on the Internet? Can we make reasoning emerge in these models through a different way? And that might not need innovation on the transformer, that may need innovation more on what data you’re throwing at these models.

Similarly, another layer of innovation that’s waiting to happen is the architecture like sparse versus dense models. Clearly, mixture of experts is working, GPT-4 is a mixture of experts, Mixtral is a mixture of experts, Gemini 1.5 is a mixture of experts. So even there, it’s not one model for coding, one model for reasoning and math, one model for history that depending on your input, it’s getting routed to the right model. It’s not that spares.

Every individual tokened is routed to a different model, but it’s happening every layer. So you’re still spending a lot of compute. How can we create something that’s actually 100 humans in one company? So the company itself has aggregated so much smarter. We’ve not created the equivalent at a model layer, more experimentation on the sparsity and more experimentation on how we can make reasoning emerge in a different way is likely to have a lot more impact than thinking about what is the next transformer.

4. Training great LLMs entirely from ground up in the wilderness as a startup – Yi Tay

People always assume it’s simply a question/debate of accelerator choice (TPUs vs GPUs etc) and all GPU clusters are created equal. For us, this soon proved to be false. As we sampled across different service providers, we find that the variance of hardware quality differs vastly even for the same hardware, i.e., GPUs (H100s). Note that here, hardware refers to overall cluster quality and not necessarily the chips or accelerators per se. Just like a lottery. Basically:

Not all hardware is created equal. The variance of cluster quality across hardware providers is so high that it is literally a lottery pertaining to how much pain one would have to go through to train good models. In short, a hardware lottery in the era of LLMs.

More specifically, we’ve leased a few clusters from several compute providers, each with a range of hundreds to thousands of chips. We’ve seen clusters that range from passable (just annoying problems that are solvable with some minor SWE hours) to totally unusable clusters that fail every few hours due to a myriad of reasons. Specifically, some clusters have nodes that fail every N hour with issues ranging from cabling issues (where N is unreasonably small), GPU hardware errors etc. Even more surprisingly, every cluster across the same provider could also be vastly different in terms of how robust it was…

…Did I mention you’ll also get a different Model Flop Utilisation (MFU) for different clusters!? This was a non negligible amount of compute wasted if one is unlucky enough to find a provider with badly cabled nodes or some other issues. Systems with very sub-optimal file systems would have the MFU of training runs tank the moment a team mate starts transferring large amounts of data across clusters.

Every service provider also had different levels of support. These range from being polite to nonchalant, “chatgpt-style” canned responses to blaming the user for every single thing that goes wrong.

Overall, every single cluster we tried feels like they have their own vibe, struggles and failure modes. It was also almost as though every single cluster needed their own hot-fixes for their own set of issues – some more tolerable than others. That said, we’ve learned that fail safes are important, and finding fast hot fixes for any clusters could be key…

…We’re training our models on GPUs for the most part at Reka. Personally, I’ve used TPUs all my life when it comes to large language model training at Google pre-Reka life. CUDA and nccl were the most alien thing to me ever. (I only learned it’s pronounced “Nickel” from one of my coworkers who used to work at Nvidia lol)

I was completely taken aback by the failure rate of GPUs as opposed to my experiences on TPUs at Google. In fact, I don’t actually recall TPUs failing much even for large runs, though I was not sure if I was protected from knowing this just by the sheer robustness of the outrageously good infra and having a dedicated hardware team. In fact, the UL2 20B model (at Google) was trained by leaving the job running accidentally for a month. It never failed. If this were in GPU land, it would have failed within the first few days for sure.

That said, I think this could be more about the competency of the hardware team that manages your accelerators rather than the underlying chip. The presence of having good hardware support (from your compute provider) is important. And so much hinges on them being actually competent, reinforcing the notion of the “hardware lottery”…

…It is no secret that my favourite codebase of all time is T5X and Mesh Tensorflow (named tensors ftw) but these options quickly became not viable as 1) they don’t get as much support outside Google, 2) they are kind of deprecated and 3) they are not friendly to folks on our team that are not xooglers.

We ended up going for something vanilla, seemingly stable and more popular (i.e., pytorch) that is more accessible to most people on the team (except me lol). In my first few months, I was tripping all over pip, git, docker and all these wild life stuff. Then again, I am not 100% sure about how stable or user friendly it would be to use a google codebase externally (it would have been pretty nasty I guess).

To be very frank, I would have to say the quality of codebases externally significantly lag behind those I’ve been used to at Google. Primarily because codebase within Google tends to be written by ML rockstars themselves (e.g, Noam Shazeer, Barret Zoph, Adam Roberts, Hyung Won Chung et al.) and just feel better (e.g., superior vibes) compared to those I’ve tried externally. In particular, I found myself super annoyed with the code quality when dabbling with stuff built by other companies (some way worse than others 🤗).

5. How The Interstate Highway System Changed American Industry – Lawrence Hamtil

Signed into law in 1956 by then President Dwight Eisenhower, the Federal Highway Act created the Interstate Highway System, which would become the largest and costliest public works project in history.  Measuring almost 48,000 miles in total distance, the Interstate Highway System was completed only in 1992, more than three decades after work began, and for a total cost in today’s dollars of more than $500 billion…

…Among the beneficiaries of this huge outlay were the quarry owners and aggregate miners, who provided the gravel and rock on which the interstates were laid, the heavy machinery manufacturers who provided the graders, tractors, and steamrollers that turned those rocks into roads, and the oil and gas producers and refiners who made the gasoline and diesel that fueled the project…

…As families began to set out exploring the country on the new interstate system, restauranteurs such as Ray Kroc and Howard Johnson recognized the need to provide traveling families with predictable, familiar service.  The idea of the chain restaurant was born as interstate exit ramps guided hungry motorists to McDonald’s and Howard Johnson’s.  Families would also need places to say on longer journeys, so hotels followed restaurants in the chain model as franchises like Holiday Inn became a staple of interstate exits; early ads for the hotel underlined the value of the familiar by stating, “The best surprise is no surprise.”

The logistical flexibility provided by the interstate system also gave rise to a whole new model of retailing:  big box stores began to set up in small towns offering rich variety and low prices to consumers previously left unserved by larger retailers.  Walmart’s 1975 annual report detailed just such a model…

…Whereas not quite a century before the railroads had aided in the rise of Sears, Roebuck, and Co. as the first retailer with national reach, the interstate in the 1960s and 1970s would provide the backbone of Walmart’s logistical operations, with large distribution centers situated at critical points throughout the interstate network to facilitate inventory replenishment, as Professor Jesse LeCavalier has noted on his blog. 

Disclaimer: None of the information or analysis presented is intended to form the basis for any offer or recommendation. We currently have a vested interest in Alphabet (parent of Google), Amazon, Apple, and Microsoft. Holdings are subject to change at any time.

Ser Jing & Jeremy