Welcome to the seventeenth edition of Black Box. Last time, I predicted every platform will be a recruiting platform. This time, I try to end the Reddit blackout. (Edit: This post has been cross-published by my friends at AMID AI, go check them out!)
That a generative model’s performance deteriorates as it is trained on more generated data is something of an AI folk theorem. I first came across it in the now-canonical The New Yorker piece, but those who aren’t familiar may find my recent encounter more entertaining. I was spending the day at an AI gaming startup. “You know,” one of the co-founders said casually while unwrapping her lunch, “we’re headed towards a hot girl singularity.” I had to ask.
It turns out the team was having trouble generating female characters with realistic proportions despite using several negative prompts. Their leading hypothesis was the internet is so full of hot girl pictures that the model had no idea what realistic means. “And it gets worse,” she continued. “All of the hot girls being generated right now means future models will be trained on even more hot girls. A couple models later, the internet will be nothing but pictures of hot girls.”
As silly as this sounds, the intuition is right! Researchers just demonstrated a generalization of this process, which they term model collapse. Their work has lots of interesting implications (I recommend Rob Horning’s take) and I immediately thought of the ongoing Reddit blackout. Like many, I don’t buy that Reddit is charging for API access solely to capture some of the value its content has created as training data. But it’s also not an invalid point. Only last week, I wrote about how LLMs is making every platform into a source of hiring signal. This is an understatement; LLMs actually make platforms more valuable for everything.
If we take this motive at face value, then the question is how can platforms monetize their content without gating access. Both users and the platforms themselves should want the platforms to be as open as possible: users need services that use APIs and platforms need users for content. The problem is that content is basically a public good and LLMs are free riders. Privatizing is one solution, but it’s like treating a fever by starving the patient to death. Instead, Reddit should take inspiration from hot girls.
In their paper, the researchers explicitly call out model collapse as separate from data poisoning. In particular, model collapse is an inherent property of models recursively trained on generated data, whereas data poisoning is the result of manipulated training data. This means platforms don’t need to introduce possibly unnatural engineered content to shake off free-riding models — models trained on too much generated data will poison themselves. All the platforms have to do is add naturalistic generated content to accelerate model collapse. They can then sell the human version to LLM companies, for whom provenance matters, while keeping the full version open to all.
This is an elegant solution on multiple fronts. Ethically, accelerating model collapse is strictly defensive; an LLM is impacted only if it avoids paying for original content. Economically, it effectively turns platform content into a quasi-public good by raising excludability. Functionally, it enables platforms to have their cake and eat it too. And mechanically, it harnesses an existing phenomenon instead of creating a new system. Platforms simply encourage successive models to exaggerate truly probable events into singularities (or more precisely, an attractor).
It also seems to be a glaring breach of user trust. Users want to engage with other real users, which is why platforms spend considerable effort fighting fake accounts and bots. But I think it’s possible to extend generated content without losing users. The easiest way is to fight it less, or rather, throttle the effort so the ratio is just human enough for users. Another way is setting up “ghost towns” that mimic real user content without real users. On the other hand, platforms can incentivize users to contribute more original content by having them pay a nominal amount to verify their posts in exchange for a share of model profits.
So Mr. Huffman, tear down this paywall! Everything on Reddit has already been scraped, so only future content matters anyway. End the blackout by opening up your API again. Focus on including generated content in ways that are palatable to redditors. Believe in the power of hot girls and model collapse, and the monetization shall follow. ∎
Am I crazy or could this work? Let me know at @jwang_18 or reach out on LinkedIn.