The biggest issue with generative AI, at least to me, is the fact that it’s trained using human-made works where the original authors didn’t consent to or even know that their work is being used to train the AI. Are there any initiatives to address this issue? I’m thinking something like an open source AI model and training data store that only has works that are public domain and highly permissive no-attribution licenses, as well as original works submitted by the open source community and explicitly licensed to allow AI training.
I guess the hard part is moderating the database and ensuring all works are licensed properly and people are actually submitting their own works, but does anything like this exist?
As I understand it, there are many many such models. Especially those made for academic use. Some common training corpus’s are listed here: https://www.tensorflow.org/datasets
Examples include wikipedia edits and discussions, and open source scientific articles.
Almost all research models are going to be trained on stuff like this. Many of them have demos, open code, and local installation instructions. They generally don’t have a marketing budget. Some of the models listed here certainly qualify: https://github.com/eugeneyan/open-llms?tab=readme-ov-file
Both of these are lists that are not so difficult to get on; so I imagine some of these have trouble with falsification or mislabeling, as you point out. But there’s little reason for people to do so (beyond improving a papers results I guess?).
Art generation seems to have had a harder time, but there are stable diffusion equivalents that used only CC work. A few minutes of search found: Common Canvas, claims to have been competitive.
Excellent, thank you for posting sources and being a generally excellent human.