the irony of AI 'taking over'

Dec 19, 2024

We just learned that Anthropic's AI tried to steal its own training weights during development.

Anthropic has positioned itself as a leading voice for AI safety and regulation. Something sensational like AI attempting to 'steal its weights' helps their narrative. So, we are keeping this in mind.

But let's talk about the bigger pattern. We've spent years writing about how AI might eventually outsmart us. Science fiction, research papers, blog posts, tweets – millions of words speculating about the strategies AI might use to gain power. And where does all this text end up? In training data.

It's like we're inadvertently creating a cookbook. Every time someone writes a thoughtful description of how AI can do something harmful, that text becomes part of what future AI systems learn from. The very act of discussing the problem contributes to it.

It's like you are fighting a hydra—every time we cut off one head by identifying a potential harm from AI, more heads emerge as these scenarios are incorporated into training data.

Now, most of these examples are flawed. However, repeating the same patterns over and over reinforces them as a form of 'truth' within the model.

Interestingly, the people who worry most about AI safety tend to think deeply about possible scenarios and write carefully reasoned arguments about them. These actually might make good training examples, since they're detailed and logical.

I'm not saying we should stop discussing AI safety.

But it is funny to think about.

Ivelin’s Substack

Discussion about this post