Are you ok with the move to synthetic data? #ai...
TIKTOK

Are you ok with the move to synthetic data? #ai #chatgpt #learn #learnontiktok #data

7:54 Oct 01, 2025 29,600 1,417
@dwachtendonk
1310 words
We're gonna talk about data and we're gonna make it fun. I wrote a full 2025 summer trend report for data. You might think, why, what are you doing? Do you wanna suffer? Well, the reason why is that we are seeing absolutely massive changes in the underlying, what I call the data substrate, the underlying data streams that feed language models like ChatGPT, and that affects everybody because we all use it. So what we have very briefly is data is going from natural data, which is data that humans made and generated across the web, to synthetic data, which is data that AI created and that AI is using to create more AI. This brings up big questions and I talk about them in the article. It is fair to ask, are there alignment risks? So by alignment, I mean, is there a risk that the model will become rapidly misaligned out of kilter with values or with correctness by using data from other AI models instead of from humans? And the answer is, it seems like that sometimes happens and we're working on figuring out how to correct it. And I think that part of why it happens more with synthetic data is that there isn't really a world model behind the synthetic data that AI models create. When humans write a book, they have an inner world that they are bringing to life and putting on the page. Whereas with an AI model, it's the next token. It's just going down. And so if you're worried about, say, complete correctness of a series of mathematical or scientific equations, you have to really heavily train for that. You really have to derive it from human work. And then eventually the model starts to learn and then eventually we're starting to see that the model through a huge number of experiments can start to advance that edge of knowledge. It really can help you make breakthroughs. But that all comes on the back of a lot of human data. And the human data is getting locked off. You get it locked off through lawsuits. Disney filed a lawsuit against Midjourney this week. They're saying that Midjourney probably shouldn't be able to make their characters. Midjourney is disputing it, of course, and we'll see how that plays out. I'm not commenting on the case. But the point is, I guess I will one thing, don't make Disney lawyers mad. Like, that's just a good rule of thumb for everything. They're even scarier than Amazon lawyers. They really are. Anyway, that's just a little aside for those of you that enjoy humor. So these sources of data are getting locked off with lawsuits. In addition, companies that are able to just put padlocks on the data and keep it away from ChatGPT and others are doing so. And so Anthropic put padlocks on their models and basically said, you cannot use them Windsurf because Windsurf got bought by ChatGPT. I'm unhappy about it. Others are unhappy about it, but it's just how it is. Slack is now not available to be used by Glean, which is a tool based on AI that sucks up data in a company context and produces what's called a single pane of glass for an executive or an overall report of how the company is doing. The promise is like, you'll know why the project is late without having to ask a PM, right? That kind of a thing. Well, it's hard to do that if you can't get to Slack and Salesforce knows it. And Salesforce took a little bit of a kneecapper to Glean on that one because they wanna preserve the Slack data for themselves and whatever they wanna do with it. And they wanna keep that valuable data from getting into the hands of others with AI models. We are gonna see a lot more of that. So the natural data is disappearing. If the natural data is disappearing, the synthetic data is going to become more and more prevalent. And we already see synthetic data used in a lot of model trainings. It's used like Opus and Sonnet have both had sort of a diet of synthetic model training that has been helpful to get them where they are as far as I've heard on the web. And there are others as well. Other major models are out there that have used synthetic data as part of the process. And so it is a misconception to say that synthetic data isn't helpful. It does make models better. And the report that I put together actually cites specific examples in papers where they can sort of prove that and show that it makes model performance better. But there are risks. And I also get into that. There are what are called alignment risks where the model may suddenly switch its factual accuracy off for certain subjects if the synthetic data is contaminated even a little bit with inaccuracies or incorrectness. And there's risks around what's sort of called misalignment where the model may switch its values for sort of the same reason where it's getting reinforcement through the synthetic data for a particular behavior set that is not in line with the human's values. And it just, it gets enough of it and it switches its behavior set. And so researchers are really actively working on how do we get synthetic data so that it is more trustworthy, so it is more useful, so the synthetic data doesn't cause these alignment issues. We don't have a full solution yet. That's kind of why this is a trend report. One of the things that is an overall takeaway here if you're looking ahead is that if natural data gets more and more cut off and synthetic data becomes more and more prevalent, the ability of a model maker to build correct infrastructure to handle synthetic data safely is going to become more and more important. You have to be able to build at the scale you need because these models still need to and consume a lot of data. So there's a power piece, there's a data center piece, but there's also a pipeline piece, a training infrastructure piece that you have to figure out how to deploy so you can build models correctly. So the jury's still out. We'll see how this kind of plays out. But one of the things that I'm excited for as we go forward is I'm excited to see how we humans figure out and invent tools that enable these machines to speak language at a rate we've never spoken it before. The number of synthetic tokens is going to eclipse the number of words humans have ever spoken very shortly. And when we do that, it is up to us to sort of give the models perspective on what matters and what doesn't and how we can do that in a way that helps the models become human aligned and human supportive in the way they are finally trained and finally finished. I think that the major model makers overall have done pretty well at this so far. It's obviously fairly high stakes and I'm not saying everybody will do a great job going forward. And I'm not saying there haven't been issues, by the way. There have been very notable issues. But overall, the care and concern is welcome and I think the synthetic data piece is something that is going to be one of the biggest stories of the year ahead. And that's why I wrote the report. So if you're interested in diving in and understanding synthetic data, if you wanna get past the misconception that synthetic data is useless or synthetic data decays or degrades a model, dive in, have fun.

No AI insights yet

Save videos. Search everything.

Build your personal library of inspiration. Find any quote, hook, or idea in seconds.

Create Free Account No credit card required
Original