The Great AI Data Scam: Billions For GPUs, Nothing For Art


Wide view of a huge data centre built on top of piles of books, paintings and photographs, with small silhouetted artists underneath and business people celebrating in a glass room above.

Lede

AI companies can somehow track every click you make, but when it comes to tracking which artists they scraped, the database suddenly vanishes into the cloud.

What does not make sense

  • Training data is treated like free air, even when it is built from copyrighted books, photos, comics and fan art that someone spent years creating.
  • There is cash for new chips, shiny HQs and stadium sized data centres, but not even a token royalty per painting or paragraph.
  • Users get metered to the fraction of a token, yet we are told it is too technically difficult to count which artists were scraped and how often their style is mimicked.
  • Executives shout “publicly available” as if that magically means “ethically licenced”.
  • Lawsuits pile up from artists and authors, and the response is not a fair royalty system, just better PR and new legal disclaimers.
  • Regulators can force AI firms to publish summaries of their training data under the EU AI Act, yet nobody in power is in a hurry to build the equivalent of a performing rights society for visual art and writing.

Sense check / The numbers

  1. The LAION 5B dataset holds about 5.85 billion image–text pairs, scraped from web pages and filtered for training models like Stable Diffusion, using data that includes watermarked images. [OpenReview, LAION]
  2. Common Crawl, one of the main sources behind these giant datasets, continuously scrapes billions of web pages and releases petabytes of raw data for anyone, including AI giants, to train on. [Common Crawl]
  3. Data centre equipment and infrastructure spending reached about 290 billion dollars in 2024, with forecasts of a 1 trillion dollar market by 2030, heavily driven by AI workloads. [IoT Analytics]
  4. One analysis projects around 5.2 trillion dollars in capital expenditure will be needed to scale AI ready data centres in the coming years, a global build out of steel and silicon on the back of human generated data. [McKinsey]
  5. Global annual AI spending is expected to reach roughly 375 billion dollars by the end of 2025 and to climb past 3 trillion dollars per year by 2030, while individual artists still fight in court for basic recognition and payment. [Forbes]
  6. Visual artists have active lawsuits against Stability AI, Midjourney and others for allegedly copying and storing their works on company servers without permission, and judges have allowed key copyright claims to proceed. [Reuters, The Verge, The Guardian]
  7. From around 2026, the EU AI Act will require general purpose AI providers to respect copyright opt outs, publish summaries of training data and label AI generated content, a small start that still does not guarantee direct royalties. [EU AI Act]

The sketch

Scene 1: The Free Buffet
Panel description: A glossy AI conference stage. A slide reads “TRAINING DATA COST: 0”. In the foreground, a crowd of ghostly silhouettes labelled “writers”, “artists”, “photographers” watch through a glass wall
Dialogue:
Exec: “We trained on everything humans ever made. The data was just there.”
Artist: “So is the fire alarm. You still have to pay the crew.”
Scene 2: The Royalty Dashboard
Panel description: An exhausted illustrator at a laptop. On screen, two browser tabs: “GPU Budget” showing “2 000 000 000 USD”, and “Creator Royalty Portal” showing “404 – Not Found”.
Dialogue:
Illustrator: “Do I at least get 0.0001 for each time my style gets cloned?”
System message on screen: “Error: Artist not recognised as a data source.”
Scene 3: The Compliance Tour
Panel description: A regulator in a hi vis vest walks through a humming data centre. An AI company lawyer points proudly at a tiny sign on a server rack that reads “We respect copyright”.
Dialogue:
Regulator: “Where is the list of the works you trained on?”
Lawyer: “We published a summary.”
Regulator: “Where is the list of people you paid?”
Lawyer: “We published a summary.”



What to watch, not the show

  • The legal fiction that “publicly available” means “free to monetise at industrial scale”.
  • The power gap between individual artists and companies with multi billion dollar cloud budgets and full time legal teams.
  • The lack of a global collecting body for training data, like the music industry has for performance and mechanical rights.
  • The slow crawl of regulation that focuses on transparency and labelling, while side stepping direct payment for underlying works.
  • The move by some big rightsholders to sue and carve out private, lucrative licensing deals, while smaller creators remain unprotected.
  • The temptation for politicians to call this “innovation” and look away, as long as the data centres keep promising growth and jobs.

The Hermit take

If your shiny new intelligence only works because you quietly copied a generation of human thought for free, it is not clever, it is cheap. Pay the minds that fed your machine or admit the whole thing runs on stolen light.

Keep or toss

Keep the idea of useful tools that help artists and writers work faster, collaborate wider and reach more readers.
Toss the fantasy that you can build those tools on a mountain of uncredited books, comics, photos and songs, then tell the people who made them that the accounting is just too hard.


Sources

  • LAION 5B dataset overview – https://openreview.net/forum?id=M3Y74vmsMcY
  • LAION blog on LAION 5B and watermarked images – https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/
  • LAION dataset guide linking to Stable Diffusion training – https://skywork.ai/skypage/en/The-LAION-Dataset-Your-Ultimate-Guide-to-the-Open-Source-Engine-of-Generative-AI/1975267571554512896
  • Common Crawl data grab analysis – https://medium.com/%40adnanmasood/inside-the-great-ai-data-grab-comprehensive-analysis-of-public-and-proprietary-corpora-utilised-49b4770abc47
  • Data centre equipment spending 2024 and 2030 forecast – https://iot-analytics.com/data-center-infrastructure-market/
  • Cost of compute and 5.2 trillion dollar data centre capex – https://www.mckinsey.com/industries/technology-media-and-telecommunications/our-insights/the-cost-of-compute-a-7-trillion-dollar-race-to-scale-data-centers
  • Global AI spending projections 2025 to 2030 – https://www.forbes.com/sites/tylerroush/2025/12/01/nvidias-2-billion-synopsys-investment-makes-2025s-top-ai-deals-full-list-ranked/
  • Artists vs Stability AI and Midjourney lawsuit update – https://www.theverge.com/2024/8/13/24219520/stability-midjourney-artist-lawsuit-copyright-trademark-claims-approved
  • Visual artists’ amended lawsuit against AI image firms – https://www.reuters.com/legal/litigation/artists-take-new-shot-stability-midjourney-updated-copyright-lawsuit-2023-11-30/
  • Disney and Universal lawsuit against Midjourney – https://www.theguardian.com/technology/2025/jun/11/disney-universal-ai-lawsuit
  • Authors Guild and writers vs OpenAI and Microsoft – https://roninlegalconsulting.com/state-of-play-ai-v-copyright/
  • High level summary of the EU AI Act copyright and training rules – https://artificialintelligenceact.eu/high-level-summary/
  • EU AI Act 2026 changes for training data and copyright – https://scalevise.com/resources/eu-ai-act-2026-changes/
  • EU guidance on transparent AI systems and labelling – https://digital-strategy.ec.europa.eu/en/faqs/guidelines-and-code-practice-transparent-ai-systems
  • IP transparency obligations for general purpose AI – https://www.hsfkramer.com/notes/ip/2025-09/ai-regulation-hasnt-taken-a-summer-break-transparency-requirements-re-training-data-and-compliance-with-copyright-law-come-into-force
  • Key issue 5: Transparency obligations under EU AI Act – https://www.euaiact.com/key-issue/5

Satire and commentary. Opinion pieces for discussion. Sources at the end. Not legal, medical, financial, or professional advice.

Leave a Reply

Your email address will not be published. Required fields are marked *


Satire and commentary. My views. For information only. Not advice.


JOIN OUR NEWSLETTER
And get notified everytime we publish a new blog post.