There are a lot of sensationalist headlines in the media surrounding AI, but there is a particular topic that probably needs to be shouted louder. AI needs training. And that training needs data—good data. And yet, while we continue to see new AI tools released almost daily, it feels like there’s very little fuel for the fire. Headline numbers speak for themselves; as good-quality data is restricted from being accessed by LLMs, we’re heading very quickly towards a bottleneck where fancy-looking AI tools just don’t deliver on what they promise.
D-Day Is Closer Than You Think
AI models are voracious. The exponential growth of the industry is far outpacing anything else we have ever seen in the Internet age. The amount of useful original content on the internet can’t keep up with the appetite that these large-scale foundational models have. It’s a bit like feeding a bodybuilder a Happy Meal every day and expecting them to feel satiated. We’re going to hit a wall soon.
I think you could argue we’ve already reached that point; if you look at most of the work that labs are now doing with AI, there is almost no focus on quantity. Only quality. The focus is shifting from simply accumulating massive datasets to extracting more value from the data we already have. Take Microsoft as one example; they have run their Phi series of models on an ethos of “We don’t care about scale of data. We care about the quality of data.” They’ve done a lot of research on training models—on what they call textbook quality data—and they have had phenomenal results with models that are probably one-hundredth of the scale of the state-of-the-art frontier models, just by really getting deep into the data quality.
As a side note, there is, of course, still untapped potential in existing data sources. There’s probably a lot of data out there in the world that isn’t particularly machine-readable at the moment, whether it’s just something that has to be processed and cleaned or codified.
Synthetic Data Is Perfect in the Lab, but Flawed in the Wild
As we exhaust high-quality open-source data, many are turning to synthetic data as a potential solution. Synthetic data is extremely powerful: it allows you to collect data that perfectly suits whatever your current objective is.
However, synthetic data requires careful curation and quality control. The main consideration, at least in my own experience, is that while it can be created relatively easily in some places, there’s still a huge amount of post-processing that has to be done to the data, because you’re trusting the outputs of an LLM. So you need to make sure you have filtering and quality evaluation mechanisms to ensure the data you’re outputting synthetically has the signal in it that you require to be able to train your models, and also reflect real-world applications.
This means that synthetic data can be really expensive to produce. You may well also require a good amount of seed data to start to actually generate your own synthetic data.
Data as Money
As public data sources become exhausted, we’re also seeing a trend toward the privatization of valuable training data. Companies like Reddit are beginning to charge for data access—although the annual $60 million price tag seems ridiculously cheap – and I think that there is definitely scope for companies with proprietary datasets to sell to foundational model companies. Especially as this is dynamic data that continues to be updated—very different from something like encyclopedic knowledge.
But there’s more to consider beyond how to value data. There are ethics. Take healthcare as just one example; while some may argue for the open sharing of valuable data for the benefit of humanity, the reality is that most companies view their data as a valuable asset. Even Google’s Alphafold – while appearing like a torch-bearer for the future of humanity – is still a financially motivated project. In my opinion, a multi-billion dollar return will almost certainly outweigh any other considerations.
The Big Clean-up
Finally, there’s a major shift in focus towards extracting more value from existing datasets. This shift is opening up new opportunities for innovation in data processing and analysis. I feel like a lot of the innovation is going to be in ways of cleaning data and getting more signal out of existing data, rather than just piling more data on top.
I should note that as soon as you start doing cleaning, you start introducing opinions from the person doing the cleaning as to what is useful and what is not useful. The challenge lies in balancing the need for data cleaning with the risk of introducing bias.
It’s also crucial to be aware of potential biases in the data itself. There are a lot of keyboard warriors online and with that quantity, you may find that an extreme view is actually going to be very commonplace. Of course, when you ask real people in real life, do they hold that opinion? The vast majority of people don’t. This really highlights how important it is to carefully curate our data and consider real-world perspectives when we’re working with online data sources. We need the right mix of quality and representation.
I think we’re about to see an evolution from who can build the next big thing to who can most effectively leverage existing data sources. This will require moving from simply accumulating more data to developing sophisticated techniques for data cleaning, analysis, and synthetic data generation.
The AI companies that solve the data problem first will indeed dominate the market. That’s why all players in the AI space need to start treating data with the same reverence they give to model architecture.