The Role of Data

Our knowledge has proven to be less important than we hoped.
An ocean of data
Oceans of data are needed for machine learning.
The examples from the last section highlight a critical point about automation that is worth repeating: it requires boatloads of data. A company like Google can successfully use ML because they have access to massive datasets about what their users are looking for when they use their search engine. If you don't have data, you can't work with AI.
Data is behind the success of many other large companies. Here are just a few examples:
  • Netflix uses a massive trove of user data and ML to determine what movies and TV shows to recommend to its users.
  • Amazon utilizes its large datasets about buying behavior to figure out how to entice people to add more items to their shopping carts.
  • Facebook and Instagram deliver highly targeted ads based on each user's individual "liking" behavior.
However, there are some surprising truths about how data is used in the AI world that are worth considering.
For one, it's no longer a matter of having clean data. OpenAI's GPT systems, although still flawed, have brought high-level AI tech to the masses by simply ingesting everything on the internet. The generational differences between GPT-2 and GPT-3, for example, were purely based on how much data was fed into each - the more data, the better the results, apparently.

The Bitter Lesson

There's a famous essay called The Bitter Lesson by Richard Sutton which lays this out. It's controversial in the AI world, but the jist of what he wrote is this: every time we just scale up our computing power and data, we get better results than when we try to incorporate our own knowledge into these systems. It's a "bitter lesson" because we take it as a hit to our ego when a computer just gobbles up a bunch of noisy data and outperforms AI that humans have painstakingly tried to infuse with our collective knowledge.
That's been true with a wide variety of systems, from Deep Blue to ChatGPT. Just throw a bunch of processors and a massive amount of data at a problem, and boom, you get an AI system that outperforms anything we build by hand.

Proprietary Datasets

Why am I telling you all of this? Because there's a race on now in the world of AI, to find and leverage specialized datasets. People in the field originally thought that data would be the prime driver of monopolization in AI, but that might not be the case—if all you need is a ton of data, then just hoover up the entire internet! Nobody can own that, and it commodifies anything that leans on that dataset.
In some sense, it will create a situation where everyone is using pretty much the same dataset, with maybe a few minor differences. That's good news for consumer-facing AI, because it means it's very unlikely anyone will own it all (hopefully).
That leaves only one type of data with value: highly specialized that can be used for fine-tuning. Anybody who wants to build a product that just depends on the entire internet's will, for reasons outlined above, run into problems. But a company (or an individual, in some cases) can generate a competitive edge for themselves by training their AI on proprietary datasets.
Once they have that data, they can train it through the fine-tuning process to do specific tasks at levels that often far exceed what a human is capable of. Even if it's only average compared to humans, the sheer volume of outputs that they can generate mean these algorithms are extremely valuable.
The companies I mentioned before will still have quite a competitive advantage, for example. Consider that Google still has access to their own search results, which is a massive proprietary dataset that is worth billions of dollars. Likewise for Netflix, Amazon, Facebook, and so on.
As you move deeper into the world of AI, ask yourself: what kind of data do I have access to that nobody else does? Much of the economic opportunity moving forward (at least in industries where this tech can play a role) will be in these datasets. You can either use it for your own purposes (such as pumping out specific types of content) or sell it to somebody else for their ML algorithms to gobble up.

Summary

  • Data is the core component of all AI systems.
  • It appears that sheer volume of data and computational power are enough to beat humans in many AI applications.
  • The best way to create an edge with AI is to train it on a large, proprietary dataset and then fine-tune it.