Your's truly has a quote in the San Jose Mercury News...

All posts Johannes's posts Post Back to profile

Your's truly has a quote in the San Jose Mercury News in article:

"New artificial intelligence: Will Silicon Valley ride again to riches on other people’s products?"

about the unauthorized use of web content for training of #AI models. In the interview, I pointed out to the writer that in the #Fediverse we have the same problem with indexing and archiving permissions, and the proposed solution is marking up posts with metadata. But that didn't make it into the article.

https://www.mercurynews.com/2023/06/18/new-artificial-intelligence-will-silicon-valley-ride-again-to-riches-on-other-peoples-products/

Like 20 Jun 2023 at 21:35 | Open on social.coop

4 comments

Catherine Berry

@J12t

Where do you draw the not-OK line in this sequence?

* Human reading websites to learn a subject.
* Hand-creating a website that summarizes and comments on other websites that cover a particular subject.
* Automating creation of an index of websites covering a particular subject.
* Automating creation of an index of all websites [search engine].
* Implementing summary extraction and advanced semantic matching on a search engine.
* Training an LLM on all websites.

20 Jun 2023 at 23:38 | Open on mastodon.sdf.org

Johannes Ernst

@isomeme Me personally? I want to mark up all my content with my terms for use of that particular piece of content, and I want everybody to respect those terms.

Those terms would vary dramatically by the particular piece, and also by who the user is. Eg I always wanted to be able to say "if you are an individual, do whatever you like with it, but if you are a company with a billion dollar market cap, you better pay me a lot."

So you want to train your AI? Go ahead! If you are Meta? Nope.

20 Jun 2023 at 23:50 | Open on social.coop

Catherine Berry

@J12t

So an extension of robots.txt, essentially. "Only read this if your planned use of the data conforms with <license>." And crawlers could be made aware of what licenses they should not index.

Of course, this relies on compliance under threat of a lawsuit if a crawler operator is caught misusing your data. My sense is that this would scare off only small companies. Large companies have enough good lawyers to reliably dodge lawsuits, while for state actors, lawsuits aren't a threat at all.

21 Jun 2023 at 6:08 | Open on mastodon.sdf.org

Johannes Ernst

@isomeme robots.txt++++. Much more detailed, and on a per-post basis, not site-wide.

And ideally, to avoid a DNT-style let’s-just-ignore-it disaster, ideally the whole thing would be subject to a Ricardian contract framework that anchors the post markup in contract law.

21 Jun 2023 at 6:27 | Open on social.coop