What scraping means, how AI companies collect artwork, what Glaze and Nightshade actually do, and what artists can realistically do to protect their work. Learn how AI artwork scraping works, what Glaze and Nightshade do, and the best layered defenses artists can use to protect online art.
The Problem: Your Artwork Can Be Collected Without Your Permission
Many artists have heard the word “scraping” and felt the alarm immediately. Sometimes people write it as “scrapping,” but the correct word is scraping. Scrapping means throwing something away. Scraping means collecting data from websites automatically.
When artists say their work is being scraped by AI companies, they mean this: automated software visits websites, social platforms, image galleries, blogs, portfolio pages, product listings, and public archives; collects images and nearby text; stores or indexes that material; and may use it to train or fine-tune artificial intelligence systems.
For visual artists, this is not a small technical issue. It reaches directly into authorship, livelihood, consent, credit, and artistic identity. A painting, illustration, digital artwork, character design, or photographic style may become part of a training dataset without the artist ever being asked. The artist may later see AI-generated images that imitate their visual language, weaken their market, or confuse the public about what is authentic human-made work.
This is why artists are searching for tools like Glaze, Nightshade, Kudurru, Have I Been Trained, robots.txt controls, AI crawler blocking, and other defenses. The hard truth is that no single tool can completely stop scraping. The better answer is a layered protection system.


What Does Scraping Mean?
Scraping is the automated extraction of information from websites. A scraper is usually a bot, crawler, or script. It does not browse the internet like a human. It requests web pages, reads the code, follows links, finds images, downloads files, and records useful information such as titles, captions, alt text, page text, tags, URLs, and metadata.
In the AI training context, scraping is used to build large datasets. For image-generation systems, the most valuable material is often an image-text pair: the image itself plus text that describes it. That text may come from alt text, captions, filenames, page descriptions, surrounding article text, product titles, hashtags, or user descriptions.
A famous example is LAION-5B, a massive image-text dataset containing billions of image-text pairs. It was built by parsing Common Crawl data, locating image tags and alt text, downloading images, and filtering image-text pairs with machine-learning tools. [1]
This matters because AI systems do not only need images. They need images attached to words. If a dataset contains thousands of images labeled with artist names, movement names, style words, genre terms, and descriptive phrases, the AI system can learn associations between those words and visual patterns.
How AI Scrapes Artwork From the Internet
The basic pipeline usually looks like this:
First, a crawler starts with a list of websites, archives, search indexes, sitemaps, or URLs. It requests pages from those sites.
Second, it reads the page code and extracts image links. That may include portfolio images, product images, blog images, gallery images, thumbnails, social previews, and embedded media.
Third, it collects descriptive text. This may include alt text, captions, titles, tags, surrounding paragraphs, artist names, product names, and page metadata.
Fourth, the system downloads or indexes the images and creates records that pair images with text.
Fifth, the dataset is cleaned. Duplicates may be removed. Very small or low-quality files may be filtered out. Some systems use tools such as CLIP to measure whether the image and text seem to match.
Finally, the dataset is used to train or fine-tune a model. In image-generation systems, the model learns relationships between words and visual structures: colors, compositions, textures, objects, lighting, pose, medium, genre, and style.
This does not always mean the finished AI model stores a clean copy of every original image. Training is more complicated than copying and pasting images into a searchable folder. But the original works are still copied, processed, and used to shape a system that may later produce outputs influenced by those works. For artists, the consent issue remains.


Why AI Companies and Model Builders Scrape Artwork
AI systems require enormous amounts of training data. A model trained on a small, narrow dataset will have limited visual range. A model trained on billions of image-text pairs can learn more subjects, styles, media, lighting conditions, compositions, and cultural references.
From the AI developer’s point of view, scraping is fast, scalable, and cheap compared with licensing every image individually. From the artist’s point of view, that is exactly the problem. The public internet has often been treated as if availability equals permission. Artists know that is false. Public viewing is not the same as consent to train a commercial system.
This tension is now one of the central conflicts of the AI era: AI companies want vast datasets; artists want consent, attribution, compensation, and control.
Can Scraping Really Be Stopped?
Not completely.
A public image on the open internet can usually be downloaded by someone. Even if right-click is disabled, a browser still needs to load the image for a human viewer to see it. Anything visible can usually be copied, screenshotted, cached, rehosted, or collected.
However, “cannot be stopped completely” does not mean “do nothing.” Artists can reduce risk, create friction, document ownership, signal legal restrictions, block cooperative crawlers, interfere with training usefulness, and make unauthorized scraping more expensive or less reliable.
The correct goal is not perfect protection. The correct goal is layered defense.


Glaze: What It Is and What It Does
Glaze is one of the most important artist-facing tools created in response to AI style mimicry. It was developed by researchers at the University of Chicago’s SAND Lab. Glaze applies what the researchers call “style cloaks” to an image before the artist posts it online. These cloaks are subtle pixel-level changes that are intended to look minor to human viewers but confuse AI models that try to learn the artist’s style. [2]
Glaze does not stop a scraper from downloading the image. That distinction is important. Glaze is not a lock on the file. It is a defensive alteration of the image so that, if the scraped image is used for training or fine-tuning, the AI model receives a distorted signal about the artist’s style.
In plain language: Glaze tries to make the artwork less useful as training data for copying your style.
The Glaze team has also released WebGlaze, a browser-based option for artists who cannot run the desktop software locally. The Glaze project says the desktop app runs locally and does not send artwork back to the team, while WebGlaze processes images through University of Chicago servers and deletes both original and processed images after completion. [3]
What People Are Saying About Glaze
Glaze has received serious attention from major technology, art, and media publications. The Glaze project’s own media page lists coverage from The New York Times, MIT Technology Review, TechCrunch, BBC, CNN, NBC News, Scientific American, Business Insider, The Register, Artnet, and others. It also notes awards and recognition, including a USENIX Internet Defense Prize and a distinguished paper award. [4]
The positive view is that Glaze gives artists a practical tool where they previously had almost none. It allows artists to keep posting online without simply surrendering their style to automated collection. Many artists see this as a necessary act of self-defense.
The critical view is that Glaze is not permanent, not foolproof, and not equally effective across every style or attack method. The Glaze team itself acknowledges that it is not a panacea and that future algorithms may weaken current protections. It also notes that Glaze is less effective when the target style is already deeply represented in base models, such as broad historical styles or popular visual genres. [5]
A major 2024 attack paper argued that adversarial image protection tools, including Glaze, Mist, and Anti-DreamBooth, can be weakened or bypassed by relatively accessible preprocessing techniques such as noisy upscaling. [6] The Glaze team responded by releasing updates intended to improve robustness and argued that imperfect protection can still be valuable, comparing it to antivirus software, firewalls, and spam filters: useful, but never final. [7]
That is the most balanced reading of Glaze. It is not magic. It is not useless. It is a moving defense in a moving conflict.


Nightshade: The More Aggressive Companion Tool
Nightshade comes from the same broader University of Chicago research ecosystem. While Glaze is defensive, Nightshade is more aggressive. It is designed to “poison” training data by making an image appear normal to humans but misleading to AI training systems. The goal is to cause models trained on scraped protected images to learn wrong associations.
For example, a poisoned image of one subject may push the model toward a different concept during training. The Nightshade research argues that targeted poisoning can affect model behavior with fewer images than many people assumed. [8]
Nightshade became extremely popular among artists after release. Artnet reported that it was downloaded more than 250,000 times in less than a week. [9]
Nightshade is best understood as a deterrent and collective defense. It only matters if protected images are scraped and used in future training. It does not remove artwork already inside existing models. It also raises more ethical and strategic questions because it is intentionally adversarial. Still, for artists whose work is being used without consent, Nightshade represents a form of resistance.
Kudurru: Blocking Scrapers at the Website Level
Kudurru, developed by Spawning, takes a different approach. Instead of altering the image itself, Kudurru is designed as a web scraping defense network. It identifies scraping behavior and blocks scraper access across participating domains. It also offers the option to send bad or substitute data to scrapers. [10]
This is useful because image-level tools and website-level tools solve different problems. Glaze changes the image. Kudurru tries to stop or disrupt the scraper before it collects the image.
The weakness is coverage. A scraper can only be blocked if it is detected, and Kudurru is strongest when many sites participate. It also helps most when the artist controls the website where the art is hosted. It cannot fully protect images already uploaded to social platforms the artist does not control.
For artists using WordPress, Kudurru is worth watching closely because Spawning announced it would begin with an easy-to-install WordPress plugin. That makes it especially relevant for independent artists, portfolio sites, art blogs, and ecommerce stores.


Have I Been Trained and Do-Not-Train Registries
Have I Been Trained was created to help artists search whether their images appeared in certain AI training datasets and to opt out through Spawning’s do-not-train registry. At the time of this research, the public Have I Been Trained site appeared to be under maintenance, but it still referenced the Do-Not-Train registry for AI trainers seeking to respect opt-outs. [11]
This kind of tool is useful for awareness and signaling. It can help artists see whether their work has appeared in known datasets and express a preference against future use.
The limitation is enforcement. An opt-out registry only works when model builders respect it. It may not remove art already used in prior models. It may not affect companies that do not check the registry. It should be used, but not treated as sufficient protection by itself.
Robots.txt, AI Crawler Blocking, and Cloudflare
If an artist controls their own website, they should use website-level defenses.
A robots.txt file is a plain text file at the root of a website that tells crawlers what they are allowed or disallowed from accessing. Traditional search engines have used robots.txt for decades. AI companies now use crawler tokens such as GPTBot, Google-Extended, CCBot, ClaudeBot, Applebot-Extended, and others.
OpenAI’s documentation says website owners can allow OpenAI search crawling while disallowing GPTBot for training use. Google’s documentation says Google-Extended can be used to control whether content Google crawls may be used for training future Gemini models and related AI uses, without affecting Google Search inclusion. [12]
However, robots.txt is not a wall. It is a request and a rights signal. Cooperative crawlers may respect it. Bad actors may ignore it.
Cloudflare’s AI crawler tools are stronger because they can create and manage robots.txt instructions and also enforce blocking through AI Crawl Control. Cloudflare itself warns that robots.txt compliance is voluntary and that crawlers may disregard the directives. [13]
For artists with their own domain, Cloudflare-style active blocking is one of the strongest practical website-level defenses currently available.
A starter robots.txt block might look like this:
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: meta-externalagent
Disallow: /
This list should be updated regularly. AI crawler names change. New bots appear. Some companies separate search crawlers from training crawlers. Some do not.





Cara and “No AI” Portfolio Platforms
Cara is an artist-focused social and portfolio platform created in response to artist concerns about AI scraping and AI-generated content. It has integrated Cara Glaze, allowing users to apply Glaze through the platform. Cara describes Glaze as a tool designed to protect artists from style mimicry by generative AI models and recommends artists Glaze work before posting online. [14]
Cara’s advantage is cultural and practical. It is built around artist concerns, discourages AI-generated uploads, and integrates a major protection tool.
Its weakness is that no social platform can guarantee complete protection from scraping. Some artists have also criticized Cara for being closed and proprietary rather than an open, self-hosted solution. [15] For many artists, Cara is useful as part of a visibility strategy, but it should not replace an owned website and independent archive.
Watermarks, Low-Resolution Images, Metadata, and Copyright Notices
Traditional protections still matter, but they should be understood correctly.
A visible watermark can discourage casual theft and help viewers identify the source, but it may be cropped, blurred, cloned out, or ignored during AI training.
Low-resolution uploads reduce the commercial usefulness of copied images, but they do not stop scraping. Many AI systems can still train on lower-resolution images.
Metadata and copyright notices help document ownership, but metadata can be stripped.
A clear “No AI Training” notice can support your rights position, but it does not physically stop a scraper.
Copyright registration and organized provenance records do not prevent scraping, but they strengthen the artist’s ability to enforce rights later.
These methods are not obsolete. They are just incomplete.

Review of the Main Options
Best image-level defense: Glaze.
Glaze is the best first-line tool for artists who want to keep sharing work online while reducing the risk of style mimicry. It is free, artist-focused, widely discussed, and actively updated. Its main weakness is that it is not permanent or invincible.
Best aggressive deterrent: Nightshade.
Nightshade is the strongest deterrent-style tool for artists who want to make unauthorized training riskier for model builders. It should be used carefully and selectively, especially on work that strongly represents your signature style.
Best website-level defense: Cloudflare AI crawler blocking plus robots.txt.
For artists who control their own website, active crawler blocking is stronger than simply hoping platforms protect them. Robots.txt should still be used because it creates a clear machine-readable rights signal.
Best WordPress/site-network option: Kudurru.
Kudurru is promising because it focuses on detecting and blocking scrapers rather than only modifying images. Its effectiveness depends on adoption and implementation.
Best audit/opt-out option: Have I Been Trained and Do-Not-Train registries.
These tools are useful for awareness and rights signaling, but they depend on AI companies choosing to respect them.
Best artist-community platform option: Cara with Cara Glaze.
Cara is useful for artists who want an anti-AI portfolio environment and built-in Glaze access, but it should be treated as one publishing channel, not total protection.
Best basic protection stack: watermark, lower resolution, copyright notice, metadata, provenance records.
These do not stop AI scraping, but they support attribution, discourage casual misuse, and help with enforcement.

My Recommendation: Use a Layered Artist Protection System
The best solution is not one tool. The best solution is a workflow.
Before posting artwork online, keep a high-resolution master file offline. Export a public-facing version at a reasonable display size. Add a subtle but visible signature or watermark. Preserve metadata where possible. Then run the public-facing version through Glaze. For highly distinctive works that define your signature style, consider using Nightshade as well.
Post the protected image first on your own website, not only on social media. Your own site gives you more control. Add a clear copyright and no-AI-training notice to your website terms. Use robots.txt to disallow known AI training crawlers. If possible, use Cloudflare AI Crawl Control or equivalent bot protection to enforce blocking rather than merely request it.
If you use WordPress, watch Kudurru and similar scraper-defense tools. If you use social media, assume those images are public and vulnerable. Post lower-resolution, Glazed, watermarked previews and direct people back to your own site for full collections, purchases, licensing, and official information.
Use Have I Been Trained or similar tools when available to check known datasets and submit opt-outs. Register high-value works or major collections with the copyright office when appropriate. Keep dated source files, layered files, sketches, prompts, exports, invoices, product listings, and publication records.
The strongest artist defense is not secrecy. Artists need visibility. The strongest defense is controlled visibility: show the work, but do not hand over the cleanest, highest-resolution, easiest-to-train version without friction.

Final Verdict
Can AI scraping be stopped completely? No.
Can artists reduce risk? Yes.
Is Glaze worth using? Yes, especially for artists posting original visual work online. It is the best current image-level defense against style mimicry, but it should not be treated as a permanent shield.
Is Nightshade worth using? Yes, selectively, especially as a deterrent against unauthorized future training.
What is the best overall option? A layered system: Glaze for image-level protection, Cloudflare or equivalent controls for website-level blocking, robots.txt for rights signaling, Kudurru or similar tools for scraper defense, lower-resolution watermarked uploads for public display, opt-out registries for documentation, and copyright/provenance records for enforcement.
Artists should not have to defend their work from being harvested without consent. But until laws, platforms, and AI companies catch up, artists need practical defenses. The goal is not paranoia. The goal is sovereignty: control over where your work appears, how it is used, and whether your creative identity becomes raw material for someone else’s machine.





