Are there any initiatives aimed at training generative AI using 100% public domain works and works authorized by the creator?

HiddenLayer555@lemmy.ml · 2 days ago

Are there any initiatives aimed at training generative AI using 100% public domain works and works authorized by the creator?

Artisian@lemmy.world · 2 days ago

As I understand it, there are many many such models. Especially those made for academic use. Some common training corpus’s are listed here: https://www.tensorflow.org/datasets

Examples include wikipedia edits and discussions, and open source scientific articles.

Almost all research models are going to be trained on stuff like this. Many of them have demos, open code, and local installation instructions. They generally don’t have a marketing budget. Some of the models listed here certainly qualify: https://github.com/eugeneyan/open-llms?tab=readme-ov-file

Both of these are lists that are not so difficult to get on; so I imagine some of these have trouble with falsification or mislabeling, as you point out. But there’s little reason for people to do so (beyond improving a papers results I guess?).

Art generation seems to have had a harder time, but there are stable diffusion equivalents that used only CC work. A few minutes of search found: Common Canvas, claims to have been competitive.

Crackhappy@lemmy.world · 2 days ago

Excellent, thank you for posting sources and being a generally excellent human.