European Data Protection Board issues long awaited opinion on AI models: part 3 - anonymization (See Part 1: https://shorturl.at/TYbq3 consequences and Part 2: https://shorturl.at/ba5A1 legitimate interest legal basis). 🔹️AI models are not always anonymous; assess case by case. 🔹️ AI models specifically designed to provide personal data regarding individuals whose personal data were used to train the model, cannot be considered anonymous. 🔹️For an AI model to be considered anonymous, both (1) the likelihood of direct (including probabilistic) extraction of personal data regarding individuals whose personal data were used to develop the model and (2) the likelihood of obtaining, intentionally or not, such personal data from queries, should be insignificant, taking into account ‘all the means reasonably likely to be used’ by the controller or another person. 🔹️ Pay special attention to risk of singling out, which is substantial 🔹️ Consider all means reasonably likely to be used by the controller or another person to identify individuals which may include: characteristics of training data, AI model & training procedure; context; c. additional information; costs and amount of time needed to obtain such info; available technology & technological developments. 🔹️ Such means & levels of testing may differ between a publicly available and a model to be used only internally by employees. 🔹️ Consider risk of identification by controller & different types of ‘other persons’, including unintended third parties accessing the AI model, and unintended reuse or disclosure of model. Be able to prove, through steps taken and documentation, that you have taken effective measures to anonymize the AI Model. Otherwise, you may be in breach of your accountability obligations under Article 5(2) GDPR. Factors to consider: 🔹️ selection of sources: (selection criteria; relevance and adequacy of chosen sources; exclusion of inappropriate sources. 🔹️ preparation of data for training phase: (could you use anonymous or pseudonymous); if not why not; data minimisation strategies & techniques to restrict volume of personal data included in training process; data filtering processes to remove irrelevant personal data. 🔹️ Methodological choices regarding training: improve model generalisation & reduce overfitting; privacy-preserving techniques (e.g. differential privacy) 🔹️ Measures regarding outputs of model (lower likelihood of obtaining personal data related to training data from queries). 🔹️ Conduct sufficient tests on model that cover widely known, state-of-the-art attacks: eg attribute and membership inference; exfiltration; regurgitation of training data; model inversion; or reconstruction attacks. 🔹️ Document process including: DPIA; advice by DPO; technical & organisational measures; AI model’s theoretical resistance to re-identification techniques. #dataprivacy #dataprotection #privacyFOMO #AIFOMO Pic by Grok
Data Privacy Standards for Open AI Models
Explore top LinkedIn content from expert professionals.
Summary
Data privacy standards for open AI models refer to the rules and practices that help keep personal and sensitive information safe when AI systems are trained, used, or deployed. These standards are crucial to making sure AI models do not accidentally expose, misuse, or identify individuals from the data they process or generate.
- Review data sources: Make sure any information used to train AI models does not include sensitive or personal details that could be traced back to real people.
- Adjust privacy settings: Regularly update model settings to prevent your content from being used to train AI systems and minimize what personal information is stored or remembered.
- Document and test: Keep clear records of the steps taken to anonymize data and routinely test your AI models for risks like data leaks or unintended identification.
-
-
HUGE NEWS EVERYONE: OpenAI just launched ChatGPT Enterprise. This is a significant milestone in the intersection of AI and the corporate world. Marketed as an enterprise-grade solution with advanced security, data protection, and unlimited access to GPT-4 functionalities, it is projected to fundamentally reshape work processes within organisations. However, this technological leap raises nuanced legal issues, particularly in the realms of data protection, intellectual property (IP), and the forthcoming AI Act’s foundation model regulatory obligations. ChatGPT Enterprise assures users of robust data protection, stipulating that the model is not trained on business-specific data and that all conversations are encrypted both in transit and at rest. OpenAI claims the platform's SOC 2 compliance adds an additional layer of trust in its security protocols. However, from a legal perspective, questions arise around data ownership and control. OpenAI promises not to train the model on user-specific data, but what about when a company fine-tunes the model on its own data - what are the data protection considerations then? GDPR imposes stringent requirements on data usage, sharing, and deletion, which businesses employing ChatGPT Enterprise must consider. ChatGPT Enterprise's capability to assist in creative work, coding, and data analysis poses tricky questions in relation to ownership. For example, if the AI generates a piece of written content or code, who owns the copyright? The current legal framework, which traditionally recognises human authorship, may not be fully equipped to navigate the nuances of AI-generated IP. The US District Court last week ruled that AI generated work cannot be copyrighted. What if you as a company are engaging third parties to develop code and other work output - if they are using ChatGPT enterprise to generate the outputs, there may be nothing protected by copyright, and no IP rights to assign to you. How will you address that? Then there’s Article 28b of the forthcoming AI Act which imposes strict regulatory obligations on providers of certain foundation models (like GPT4). If you finetune the model enough, that could potentially make YOU the provider with all the regulatory obligations that could bring. And if it doesn’t, you still may have user obligations. Mass adoption of AI across various sectors could draw scrutiny by competition regulators. Could OpenAI’s ubiquity in over 80% of Fortune 500 companies potentially raise concerns about market competition and behaviour? The debut of ChatGPT Enterprise marks an inflection point in the deployment of AI in enterprise environments. While its promise of improved productivity and robust data protection is enticing, businesses and legal experts must pay heed to the complex legal landscape it interacts with. Comprehensive regulation and judicious legal practice are critical in balancing technological advancement with the protection of individual and corporate rights.
-
ChatGPT's default settings are a privacy trap, and it’s not safe to use ChatGPT until you update those settings! Default settings compromise your privacy because your chats (that likely contain personal or sensitive information) are used to train OpenAI’s models, are reviewed by humans, and can be used to identify you uniquely. ChatGPT is super useful, and I’m not an advocate of advising people to stop using this tool (and others like it) for privacy reasons. In fact, I think most people should be using ChatGPT to be better and more efficient in their work and personal lives. However, using ChatGPT safely and privately is critically important. By taking these 6 simple but critical steps now, you can use ChatGPT with the peace of mind that you are minimizing your privacy and safety risk. 1. Turn off model improvement. On chatgpt.com, Click on your profile icon (in upper right corner) → Settings → Data Controls → Turn Off “Improve the model for everyone”. 2. Use “Temporary Chats” to prevent ChatGPT from saving a history of your chats. 3. Submit a “Do not train on my content” Privacy Center request at https://lnkd.in/dASS6aA8. Ironically, for a privacy site, cookie blockers may cause this website not to load, so try an incognito browser window if you get an error with that link. Enter the email associated with your account, click on the link in the confirmation email OpenAI sends you, and then submit your request. If it works, you’ll get another confirmation email, and a day or so later, you’ll get a final confirmation that this was done. 4. Consider disabling “Memory”. Memory creates ongoing personalization by remembering details between chats and your preferences. From a privacy perspective, this creates ways to uniquely identify you and may store personal information about you. Do this from Settings → Personalization → Turn Off “Memory”. 5. Keep Customization to a minimum. Under Settings → Personalization → Customize ChatGPT, be cautious of what personal information you provide. 6. Periodically delete old chats. Do this from Settings → General → Delete All Chats. For more detailed instructions and additional information on this topic, check out the Substack I wrote about this topic. #LIPostingDayApril #privacy
-
Before diving headfirst into AI, companies need to define what data privacy means to them in order to use GenAI safely. After decades of harvesting and storing data, many tech companies have created vast troves of the stuff - and not all of it is safe to use when training new GenAI models. Most companies can easily recognize obvious examples of Personally Identifying Information (PII) like Social Security numbers (SSNs) - but what about home addresses, phone numbers, or even information like how many kids a customer has? These details can be just as critical to ensure newly built GenAI products don’t compromise their users' privacy - or safety - but once this information has entered an LLM, it can be really difficult to excise it. To safely build the next generation of AI, companies need to consider some key issues: ⚠️Defining Sensitive Data: Companies need to decide what they consider sensitive beyond the obvious. Personally identifiable information (PII) covers more than just SSNs and contact information - it can include any data that paints a detailed picture of an individual and needs to be redacted to protect customers. 🔒Using Tools to Ensure Privacy: Ensuring privacy in AI requires a range of tools that can help tech companies process, redact, and safeguard sensitive information. Without these tools in place, they risk exposing critical data in their AI models. 🏗️ Building a Framework for Privacy: Redacting sensitive data isn’t just a one-time process; it needs to be a cornerstone of any company’s data management strategy as they continue to scale AI efforts. Since PII is so difficult to remove from an LLM once added, GenAI companies need to devote resources to making sure it doesn’t enter their databases in the first place. Ultimately, AI is only as safe as the data you feed into it. Companies need a clear, actionable plan to protect their customers - and the time to implement it is now.
-
If you are an organisation using AI or you are an AI developer, the Australian privacy regulator has just published some vital information about AI and your privacy obligations. Here is a summary of the new guides for businesses published today by the Office of the Australian Information Commissioner which articulate how Australian privacy law applies to AI and set out the regulator’s expectations. The first guide is aimed to help businesses comply with their privacy obligations when using commercially available AI products and help them to select an appropriate product. The second provides privacy guidance to developers using personal information to train generative AI models. GUIDE ONE: Guidance on privacy and the use of commercially available AI products Top five takeaways * Privacy obligations will apply to any personal information input into an AI system, as well as the output data generated by AI (where it contains personal information). * Businesses should update their privacy policies and notifications with clear and transparent information about their use of AI * If AI systems are used to generate or infer personal information, including images, this is a collection of personal information and must comply with APP 3 (which deals with collection of personal info). * If personal information is being input into an AI system, APP 6 requires entities to only use or disclose the information for the primary purpose for which it was collected. * As a matter of best practice, the OAIC recommends that organisations do not enter personal information, and particularly sensitive information, into publicly available generative AI tools. GUIDE 2: Guidance on privacy and developing and training generative AI models Top five takeaways * Developers must take reasonable steps to ensure accuracy in generative AI models. * Just because data is publicly available or otherwise accessible does not mean it can legally be used to train or fine-tune generative AI models or systems.. * Developers must take particular care with sensitive information, which generally requires consent to be collected. * Where developers are seeking to use personal information that they already hold for the purpose of training an AI model, and this was not a primary purpose of collection, they need to carefully consider their privacy obligations. * Where a developer cannot clearly establish that a secondary use for an AI-related purpose was within reasonable expectations and related to a primary purpose, to avoid regulatory risk they should seek consent for that use and/or offer individuals a meaningful and informed ability to opt-out of such a use. https://lnkd.in/gX_FrtS9
-
Privacy isn’t a policy layer in AI. It’s a design constraint. The new EDPB guidance on LLMs doesn’t just outline risks. It gives builders, buyers, and decision-makers a usable blueprint for engineering privacy - not just documenting it. The key shift? → Yesterday: Protect inputs → Today: Audit the entire pipeline → Tomorrow: Design for privacy observability at runtime The real risk isn’t malicious intent. It’s silent propagation through opaque systems. In most LLM systems, sensitive data leaks not because someone intended harm but because no one mapped the flows, tested outputs, or scoped where memory could resurface prior inputs. This guidance helps close that gap. And here’s how to apply it: For Developers: • Map how personal data enters, transforms, and persists • Identify points of memorization, retention, or leakage • Use the framework to embed mitigation into each phase: pretraining, fine-tuning, inference, RAG, feedback For Users & Deployers: • Don’t treat LLMs as black boxes. Ask if data is stored, recalled, or used to retrain • Evaluate vendor claims with structured questions from the report • Build internal governance that tracks model behaviors over time For Decision-Makers & Risk Owners: • Use this to complement your DPIAs with LLM-specific threat modeling • Shift privacy thinking from legal compliance to architectural accountability • Set organizational standards for “commercial-safe” LLM usage This isn’t about slowing innovation. It’s about future-proofing it. Because the next phase of AI scale won’t just be powered by better models. It will be constrained and enabled by how seriously we engineer for trust. Thanks European Data Protection Board, Isabel Barberá H/T Peter Slattery, PhD
-
On 21 October 2024, the Office of the Australian Information Commissioner (OAIC) published the ‘Guidance on privacy and developing and training generative AI models'. The Guidance applies to: 1. Developers of GenAI models or systems, who are subject to the Privacy Act. 2. Where an organisation provides PI to a developer so that they can develop or fine-tune a GenAI model. 3. Acts or practices outside Australia by organisations with an Australian link e.g., incorporated in Australia, or carrying on business in Australia (same as the Privacy Act). Some parts are mandatory, containing practices that must be followed in order to comply with Privacy Act obligations. Some parts are best practices. So if the Guidance applies to you, you should read through what it discusses: - APP 1: open and transparent management - APP 3: collection - APP 5: notification - APP 6: use/disclosure of PI - APP 10: quality, accuracy - Privacy considerations when collecting and processing a training dataset It's written in a conversational tone with sections containing examples and practical tips. This makes it easier for laypersons to understand, and examples always help! It's also very long with slightly different presentation styles as between the APPs so as I read on, TBH I found it increasingly painful to get the hang of and get through. While the Guidance says that the estimated reading time is 52 minutes, be prepared to sit down for much longer if you're thinking about operationalisation - I probably took more than 5 hours trying to digest it. In Substack I've extracted the key points and discussed compliance issues over 3 posts (links in comments, for subscribers). These are some of the key points I think covered entities need to take note of: 1. The creation of a dataset through scraped data is generally a covert method of collection, and generally it will be unfair (and contrary to APP 3.5) to collect PI covertly without the individual’s knowledge. 2. Developers shouldn’t assume that PI posted publicly can be used to train models and complies with APP 3.5. 3. The Guidance recommends that developers seek information or assurances from third parties in relation to the third party's dataset e.g. through contractual clauses. 4. Developers must consider if the dataset they intend to use for training a GenAI model, is likely to contain PI. They should consider the data in its totality, including the data, associated metadata, any annotations, labels, or other descriptions attributed to the data. 5. De-identifying PI is a use of the PI for a secondary purpose.
-
To build on my previous short post on OpenAI's ChatGPT Enterprise announcement... A common concern that organizations have when using public Generative AI services and Large Language Models relates to security and privacy. There is a real risk of accidentally sharing confidential information through prompts or allowing services to train their models using proprietary data. I have spoken with over ten Silicon Valley startups that have identified this risk as an opportunity to sell their proprietary solutions for derisking the use of Gen AI and LLMs. These solutions encrypt communication, API use, and storage, and keep private data on private instances of cloud services or on-premises. OpenAI's ChatGPT Enterprise is their response to this risk, offering enterprise-grade security and privacy. It also provides "unlimited higher-speed GPT-4 access, longer context windows for processing longer inputs, advanced data analysis capabilities, customization options, and much more." OpenAI promises not to train its models on business data or conversations, and it is SOC 2 compliant. All conversations are encrypted in transit and at rest. Additionally, a new admin console allows for easy team member management, domain verification, SSO, and usage insights, making it suitable for large-scale deployment in enterprises. In summary, I think this is the first real attempt by a company to convince large organizations to trust a model that is both open and public and secure and private. I anticipate that Microsoft, Google, AWS, and other big tech providers will introduce more secure and scalable options in the near future. Two things will happen as a consequence: 🚀 More organizations will adopt Gen AI and LLM as part of their processes and technology stack ⏰ The window of opportunity for startups that address security and privacy challenges is closing quickly Or, as Abhishek so eloquently put it: WeProtectYourDataStartups.shutDown(); Read more here: https://lnkd.in/gjXcJEMi #generativeai #openai #chatgpt Mark Oost Ron Tolido Pascal Brier Bryan Brochu
-
The European Data Protection Board published the report “AI Privacy Risks & Mitigations Large Language Models” to help developers and users systematically identify, assess, and mitigate privacy and data protection risks, supporting the responsible development and deployment of LLM systems. The report starts by explaining how LLMs work, their common applications, and performance evaluation measures. Then, it discusses how #privacy risks emerge across different #LLM service models, emphasizing the importance of understanding data flows throughout the #artificialintelligence lifecycle. Once the risks are identified and classified, the document shows how to prioritize mitigation efforts effectively. It also discusses residual risk acceptance and the iterative nature of risk. This guidance also supports the requirements of Art. 25 of the #GDPR Article (data protection by design and by default) and Art. 32 (security of processing) by offering technical and organizational measures to help ensure an appropriate level of #security and data protection.
-
MAKE YOUR AI MODEL DUMBER AND SATISFY DATA DELETION RIGHTS. The EDPB commissioned an expert report on how to effectively implement data subject rights in the context of AI. Notably, the report assumes that most models contain personal data because models are “compressed versions of the training data” and are susceptible to attacks that extract such data. (This is consistent with last year's EDPB’s AI opinion.) TRAINING DATA The expert states that the most effective way to address deletion requests would be to delete the model, remove the personal data at issue, and then retrain the model. However, he acknowledges that this can be extremely expensive for large models (not to mention the environmental impact). In light of that, the expert examines various other methods. He states that the only method currently proven to work is “sharding” training data, training on the shards, and combining the results. Then you only need to retrain a particular shard when a deletion request comes in. The report discussed other methods, including approximate unlearning (e.g., finetuning the model to forget the data), differential privacy, and model retiring (i.e., only retraining after a critical mass of deletion requests is received). OUTPUTS The report also discusses various methods to address potential privacy issues with model outputs. These include finetuning to prevent certain outputs, creating undesired outputs and telling the model to deprioritize them in the model, or instructing the model to modify outputs. Bottom line, there doesn’t seem to be a definitive method to fulfill deletion requests for AI models. Whether this will prevent enforcement actions remains to be seen. (link to report in comments) #law #artificialintelligence #privacy