Rendered at 02:17:31 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
Stagnant 11 hours ago [-]
Hard to see any practical benefit to go after common crawl. The situation of freely accessible crawled data is bad enough as it is with archive.org and CC being pretty much the only available sources. We need more initiatives like them, not less. The scary thing is how the anti-AI sentiment is being used to lock things down further.
conartist6 4 hours ago [-]
Scary how AI chokes off the very goodwill-oriented societies that it needs as its lifeblood, eh
Grimblewald 11 hours ago [-]
Its like they don't understand the problem common crawl solved rather neatly. You think the skid scrapers are bad? Wait till the competent players lose access to CC.
conartist6 10 hours ago [-]
Let's fucking go, asshats! Destroy the internet to show how valuable your product is.
mindcrime 10 hours ago [-]
These guys (the publishers) are fighting last year's war. Nobody (to a first approximation) gives a shit about going to the NY Times website, or The Guardian website, or the BBC website, etc. to find information. They expect to use search engines and AI services to find stuff, and then maybe click through to the source site(s) for more details or whatever.
The publishers need to rethink their entire take on how the Internet works or any "victory" they earn is going to be extremely Pyrrhic.
toomuchtodo 12 hours ago [-]
Crawling will go underground à la Anna’s Archive.
khelavastr 12 hours ago [-]
This is shady. Copyrighters absolutely not get to control use of their copyrighted material when people mentally, sonically, or physically reproduce it for personal use.
It's absurd to say "you can't record this book to a friend or robot".
Nobody seems to actually reproduce the copyrighted materials.
High-dimensional eigendecompositions which underpin AI similarity are some of the most literally derivative materials of texts that you can imagine.
conartist6 11 hours ago [-]
So you record a copy "for a friend" and then you sell lots of those copies as your business. All within your rights! What's mine is yours, my Comrade!
(my point being that it would be different if the product CommonCrawl provides were trained models, but this is not the case: its product is unlawful reproductions of copyrighted data for commercial use)
joshuaissac 10 hours ago [-]
> then you sell lots of those copies as your business
Common Crawl is not a business and is not selling anything.
conartist6 10 hours ago [-]
It's awfully hard to claim you aren't selling anything when you're giving other people's stuff away for free and they would be selling it if you hadn't, um, "saved them the trouble"
The publishers need to rethink their entire take on how the Internet works or any "victory" they earn is going to be extremely Pyrrhic.
It's absurd to say "you can't record this book to a friend or robot".
Nobody seems to actually reproduce the copyrighted materials.
High-dimensional eigendecompositions which underpin AI similarity are some of the most literally derivative materials of texts that you can imagine.
(my point being that it would be different if the product CommonCrawl provides were trained models, but this is not the case: its product is unlawful reproductions of copyrighted data for commercial use)
Common Crawl is not a business and is not selling anything.