UK Information Commissioner's Office issues for public comment guidance on legal basis for scraping data from the web or to train AI (Chapter 1 of Consultation). Key points: 🔹 Most developers of generative AI rely on publicly accessible sources for their training data, usually through web scraping. 🔹 To be fair and lawful your data collection can't be in breach of any laws - this will not be met if the scraping of personal data infringes other legislation outside of data protection such as intellectual property or contract law. 🔹 Legitimate interests can be a valid lawful basis for training generative AI models on web-scraped data, but only when the model’s developer can ensure they pass the three-part test Purpose test: is there a valid interest? 🔹 Despite the many potential downstream uses of a model, you need to frame the interest in a specific, rather than open-ended way, based on what information you can have access to at the time of collecting the training data. 🔹 If you don’t know what your model is going to be used for, how can you ensure its downstream use will respect data protection and people’s rights and freedoms? Necessity test: is web scraping necessary given the purpose? The ICO’s understanding is that currently, most generative AI training is only possible using the volume of data obtained though large-scale scraping. Balancing test: do individuals’ rights override the interest of the generative AI developer? 🔹 Collecting data though web-scraping is an ‘invisible processing’ activity. 🔹 Invisible processing and AI related processing are both seen as high-risk activities that require a DPIA under ICO guidance Risk mitigations to consider in the balancing test If you are the developer and rely on the public interest of the wider society for the first part of the test, you should be able to: 🔹 control and evidence whether the generative AI model is actually used for the stated wider societal benefit; 🔹 assess risks to individuals (both in advance during generative AI development and as part of ongoing monitoring post-deployment); 🔹 implement technical and organisational measures to mitigate risks. If you deploy a third party model through an API: 🔹 Developer should implement TOMS (eg. output filters) and organizational controls over the deployment such are: limit queries (preventing those likely to result in risks or harms to individuals) and monitoring the use of the model 🔹 Use contractual restrictions and measures, with the developer legally limiting the ways in which the generative AI model can be used by its customers If you provide a model to a third party: 🔹 Use contractual controls to mitigate the risks of lack of control on how the model is used - but that might not be effective 🔹 You need to evidence that any such controls are being complied with in practice #dataprivacy #dataprotection #privacyFOMO https://lnkd.in/eev_Qhah
Sharing Data Responsibly In AI Model Training
Explore top LinkedIn content from expert professionals.
Summary
Sharing data responsibly in AI model training involves implementing ethical, legal, and technical safeguards to ensure that data used for training machine learning models respects individuals' privacy, ownership rights, and regulatory requirements while addressing potential security risks.
- Establish clear policies: Create detailed privacy policies and user agreements that specify data collection, usage, and ownership practices to ensure compliance with regulations.
- Secure and verify data: Use cryptographic controls, strict access protocols, and regular audits to ensure data confidentiality, integrity, and lawful usage throughout the AI training process.
- Mitigate data risks: Proactively address issues like data poisoning, bias, and drift by employing privacy-preserving techniques, secure deletion protocols, and continuous risk assessments.
-
-
The Cybersecurity and Infrastructure Security Agency together with the National Security Agency, the Federal Bureau of Investigation (FBI), the National Cyber Security Centre, and other international organizations, published this advisory providing recommendations for organizations in how to protect the integrity, confidentiality, and availability of the data used to train and operate #artificialintelligence. The advisory focuses on three main risk areas: 1. Data #supplychain threats: Including compromised third-party data, poisoning of datasets, and lack of provenance verification. 2. Maliciously modified data: Covering adversarial #machinelearning, statistical bias, metadata manipulation, and unauthorized duplication. 3. Data drift: The gradual degradation of model performance due to changes in real-world data inputs over time. The best practices recommended include: - Tracking data provenance and applying cryptographic controls such as digital signatures and secure hashes. - Encrypting data at rest, in transit, and during processing—especially sensitive or mission-critical information. - Implementing strict access controls and classification protocols based on data sensitivity. - Applying privacy-preserving techniques such as data masking, differential #privacy, and federated learning. - Regularly auditing datasets and metadata, conducting anomaly detection, and mitigating statistical bias. - Securely deleting obsolete data and continuously assessing #datasecurity risks. This is a helpful roadmap for any organization deploying #AI, especially those working with limited internal resources or relying on third-party data.
-
During seed round due diligence, we found a red flag: the startup didn’t have rights to the dataset used to train its LLM and hadn’t set up a privacy policy for data collection or use. AI startups need to establish certain legal and operational frameworks to ensure they have and maintain the rights to the data they collect and use, especially for training their AI models. Here are the key elements for compliance: 1. Privacy Policy: A comprehensive privacy policy that clearly outlines data collection, usage, retention, and sharing practices. 2. Terms of Service/User Agreement: Agreements that users accept which should include clauses about data ownership, licensing, and how the data will be used. 3. Data Collection Consents: Explicit consents from users for the collection and use of their data, often obtained through clear opt-in mechanisms. 4. Data Processing Agreements (DPAs): If using third-party services or processors, DPAs are necessary to define the responsibilities and scope of data usage. 5. Intellectual Property Rights: Ensure that the startup has clear intellectual property rights over the collected data, through licenses, user agreements, or other legal means. 6. Compliance with Regulations: Adherence to relevant data protection regulations such as GDPR, CCPA, or HIPAA, which may dictate specific requirements for data rights and user privacy. 7. Data Anonymization and Security: Implementing data anonymization where necessary and ensuring robust security measures to protect data integrity and confidentiality. 8. Record Keeping: Maintain detailed records of data consents, privacy notices, and data usage to demonstrate compliance with laws and regulations. 9. Data Audits: Regular audits to ensure that data collection and usage align with stated policies and legal obligations. 10. Employee Training and Policies: Training for employees on data protection best practices and establishing internal policies for handling data. By having these elements in place, AI startups can help ensure they have the legal rights to use the data for training their AI models and can mitigate risks associated with data privacy and ownership. #startupfounder #aistartup #dataownership