The Generative AI Copyright Disclosure Act introduced by @RepAdamSchiff is great, but there's one change it would be great to see: require gen AI companies to disclose *all* their training data, not just copyrighted works. The main reason would be to ensure synthetic data is disclosed. Synthetic data is non-copyrighted content that is itself often created using a gen AI model trained on copyrighted works without permission. If companies don't have to disclose this, they will be able to launder copyrighted works through other models in this way without scrutiny. A second, lesser reason is that it would actually be easier to demand of gen AI companies. There is sometimes uncertainty over which copyrighted works a large dataset contains. If you have to reveal all your training data, this isn't an issue. As I say, it's a good bill and a move in the right direction. But by expanding it to cover all of a model's training data, it would better address the problem (opaque exploitation of copyrighted work without consent) and be easier to enforce.
The Generative AI Copyright Disclosure Act introduced by @RepAdamSchiff is great, but there's one change it would be great to see: require gen AI companies to disclose *all* their training data, not just copyrighted works. The main reason would be to ensure synthetic data is disclosed. Synthetic data is non-copyrighted content that is itself often created using a gen AI model trained on copyrighted works without permission. If companies don't have to disclose this, they will be able to launder copyrighted works through other models in this way without scrutiny. A second, lesser reason is that it would actually be easier to demand of gen AI companies. There is sometimes uncertainty over which copyrighted works a large dataset contains. If you have to reveal all your training data, this isn't an issue. As I say, it's a good bill and a move in the right direction. But by expanding it to cover all of a model's training data, it would better address the problem (opaque exploitation of copyrighted work without consent) and be easier to enforce.
@ednewtonrex @RepAdamSchiff can you ask your fairly-trained clients to do that, please? I would love to study their augmenting techniques.
@ednewtonrex @RepAdamSchiff Nobody nor any company has to disclose if they've used Public Domain music, art, film, comics, photography or illustrations for anything. No filmmakers, musicians or commercial artists ever had to do that before, and most never bothered with credits to the original creators.
@ednewtonrex @RepAdamSchiff Soon it'll be genAI produced data sets all the way down. :)
@ednewtonrex @RepAdamSchiff its very likely that any works that were supposedly 'removed' from current datasets were replaced with ai generated ones that effectively add the same weight to the system. We've seen the research into generating synthetics and locking in real images vectors in models.
@ednewtonrex @RepAdamSchiff I think you're right about this. By only requiring copyright data disclosure, companies that think training data should be a "moat" will use 1 of 2 strategies: 1. Disclose everything, its easier. 2. Disclose only copyright data, but they decide what they think is copyrighted data
@ednewtonrex @RepAdamSchiff It needs to be harsher than that. Some of these are in the EU AI Act! -Disclose all training data. -Delete training data obtained without consent. -AI should not be allowed to replace ppl at the workplace. -Deepfake technology of any kind should be high risk and either ...