The Washington Post recently posted a story about sites that have been used to train ChatGPT and other AI language models for chatbots. Included in the post is the ability to look up your site to see if it’s been used. And I’m sorry to report, that Silicon Florist has. You’re welcome…?
A web crawl may sound like a copy of the entire internet, but it’s just a snapshot, capturing content from a sampling of webpages at a particular moment in time. C4 began as a scrape performed in April 2019by the nonprofit CommonCrawl, a popular resource for AI models. CommonCrawl told The Post that it tries to prioritize the most important and reputable sites, but does not try to avoid licensed or copyrighted content.
To see if your site is in the Google C4 dataset, visit “Inside the secret list of websites that make AI like ChatGPT sound smart” from the The Washington Post.