Your Data Is Feeding AI Models — Here’s What the New Legal Risks Mean for You

If you’ve used a chatbot, an image generator, or a smart writing assistant in the past year, your words or images may have already helped train the next version of that tool. That data often comes from public web pages, social media posts, and even private messages that were scraped without a clear opt-out. Until recently, the legal question was mostly about privacy compliance—whether a company had a proper privacy policy. Now the focus is shifting to something more fundamental: where the training data came from in the first place.

What happened

In the past few months, lawmakers in several states have begun to treat the provenance of AI training data as a distinct legal risk. Illinois, which already had a biometric privacy law, postponed proposed rules on AI in employment, signaling that regulators are still figuring out how to handle data used to train hiring algorithms. Meanwhile, Colorado passed the nation’s first comprehensive AI law that explicitly requires companies to disclose the sources of training data used in “high-risk” AI systems. California’s privacy enforcement agency has also started investigating how personal data flows into model training sets, and Connecticut is drafting similar rules.

These aren’t just compliance technicalities. They reflect a growing recognition that the data used to train AI models often contains personal information that was collected without meaningful consent. The National Law Review has noted that training data provenance is becoming a central legal issue, as courts and regulators examine whether companies had the right to use that data in the first place.

Why it matters to you

Your personal data—comments on forums, product reviews, even your voice recordings from customer service calls—can end up in AI training datasets in ways you might not expect. Companies often scrape publicly available content, but “public” doesn’t always mean you intended it to be used for AI. A 2019 study found that a large portion of text used to train language models came from Reddit, where users may not have anticipated their posts being repurposed this way. More recently, lawsuits have been filed against companies that allegedly used copyrighted books, medical records, and private messages without permission.

The legal shift from privacy compliance to AI governance means you may have new rights. Colorado’s law, for example, lets you request that a company stop using your personal data for training. Illinois’s Biometric Information Privacy Act already gives residents a right to sue when their face or voice data is collected without consent. If you live in a state that’s considering similar laws—like California, Connecticut, or New York—you could soon have more control.

What you can do

Until the laws catch up everywhere, here are practical steps you can take today:

  1. Check the privacy settings on services you use. Many social platforms and AI tools now have a “do not train” toggle. For example, X (formerly Twitter) allows you to opt out of data use for training, and some ChatGPT plugins offer similar controls. Look for settings labeled “data sharing” or “improve AI models.”
  2. Review privacy policies for data use clauses. If a service says it can use your content “for research and development,” that often includes training AI. Consider whether you want to continue using that service.
  3. Use opt-out tools where available. Some organizations, like the Electronic Frontier Foundation, maintain lists of companies that offer data-use opt-outs. You may need to submit a request to each one.
  4. Be aware of what you post publicly. Anything you write in a public forum or review site could end up in a training set. If you’re concerned, consider using pseudonyms or avoiding personally identifiable details.
  5. Watch for state-specific rights. If you live in Colorado, California, Illinois, or Connecticut, check your state’s consumer privacy website for information on how to exercise your rights regarding AI training data.

The situation is still evolving. Not all models are trained on public data, and some companies have already begun licensing data from publishers to avoid legal trouble. But the trend is clear: the source of training data is becoming as important as the privacy policy that supposedly covered its collection.

Sources

  • From Privacy Compliance to AI Governance: Why the Source of Training Data Is Becoming a Central Legal Risk – The National Law Review
  • Illinois Postpones Proposed Regulations on AI in Employment – The National Law Review
  • AI Regulatory Roundup: Recent Developments in Colorado, Connecticut, and California – Kelley Drye & Warren LLP
  • Global Privacy Watchlist – Mayer Brown