Data Privacy Concerns in Open vs Proprietary Models

Explore top LinkedIn content from expert professionals.

Summary

Data privacy concerns in open vs proprietary models refer to how information is protected or exposed when using either open-source AI systems (which anyone can inspect and modify) or closed, commercial ones (which are managed by a company and hidden from public view). This topic highlights the risks of sharing sensitive personal or business data with AI, and the trade-offs between control, transparency, and convenience.

Keep data local: Choosing open-source models allows you to run AI safely on your own hardware, so sensitive data never leaves your environment or gets sent to external providers.
Understand provider practices: With proprietary AI services, your data may be stored or used to improve their models, so always check what information is collected and who can access it.
Prioritize transparency: Open models give you the ability to review how information is handled and customize protections, helping you align privacy measures with your specific needs.

Summarized by AI based on LinkedIn member posts

Jon Krohn Jon Krohn is an Influencer

Co-Founder of Y Carrot 🥕 Fellow at Lightning A.I. ⚡️ SuperDataScience Host 🎙️

43,201 followers 2y
Report this post
Consumers and enterprises dread that Generative A.I. tools like ChatGPT breach privacy by using convos as training data, storing PII and potentially surfacing confidential data as responses. Prof. Raluca Ada Popa has all the solutions. Today's guest, Raluca: • Is Associate Professor of Computer Science at University of California, Berkeley. • Specializes in computer security and applied cryptography. • Her papers have been cited over 10,000 times. • Is Co-Founder and President of Opaque Systems, a confidential computing platform that has raised over $31m in venture capital to enable collaborative analytics and A.I., including allowing you to securely interact with Generative A.I. • Previously co-founded PreVeil, a now-well-established company that provides end-to-end document and message encryption to over 500 clients. • Holds a PhD in Computer Science from MIT. Despite Raluca being such a deep expert, she does such a stellar job of communicating complex concepts simply that today’s episode should appeal to anyone that wants to dig into the thorny issues around data privacy and security associated with Large Language Models (LLMs) and how to resolve them. In the episode, Raluca details: • What confidential computing is and how to do it without sacrificing performance. • How you can perform inference with an LLM (or even train an LLM!) without anyone — including the LLM developer! — being able to access your data. • How you can use commercial generative models OpenAI’s GPT-4 without OpenAI being able to see sensitive or personally-identifiable information you include in your API query. • The pros and cons of open-source versus closed-source A.I. development. • How and why you might want to seamlessly run your compute pipelines across multiple cloud providers. • Why you should consider a career that blends academia and entrepreneurship. Many thanks to Amazon Web Services (AWS) and Modelbit for supporting this episode of SuperDataScience, enabling the show to be freely available on all major podcasting platforms and on YouTube — see comments for details ⬇️ #superdatascience #generativeai #ai #machinelearning #privacy #confidentialcomputing

10 Comments
Like Comment
Akshay Pachaar

Co-Founder DailyDoseOfDS | BITS Pilani | 3 Patents | X (187K+)

166,416 followers 2w
Report this post
I just compared the best open-source and closed-source LLMs, and the results were surprising: Well, nobody wants to send their data to Google or OpenAI. Yet here we are, shipping proprietary code, customer information, and sensitive business logic to closed-source APIs we don't control. While everyone's chasing the latest closed-source releases, open-source models are quietly becoming the practical choice for many production systems. Here's what everyone is missing: Open-source models are catching up fast, and they bring something the big labs can't: privacy, speed, and control. I built a playground to test this myself. Used Comet's Opik to evaluate models on real code generation tasks - testing correctness, readability, and best practices against actual GitHub repos. Here's what surprised me: OSS models like MiniMax-M2, Kimi k2 performed on par with the likes of Gemini 3 and Claude Sonnet 4.5 on most tasks. But practically MiniMax-M2 turns out to be a winner as it's twice as fast and 12x cheaper when you compare it to models like Sonnet 4.5. Well, this isn't just about saving money. When your model is smaller and faster, you can deploy it in places closed-source APIs can't reach: ↳ Real-time applications that need sub-second responses ↳ Edge devices where latency kills user experience ↳ On-premise systems where data never leaves your infrastructure MiniMax-M2 runs with only 10B activated parameters. That efficiency means lower latency, higher throughput, and the ability to handle interactive agents without breaking the bank. The intelligence-to-cost ratio here changes what's possible. You're not choosing between quality and affordability anymore. You're not sacrificing privacy for performance. The gap is closing, and in many cases, it's already closed. If you're building anything that needs to be fast, private, or deployed at scale, it's worth taking a look at what's now available. MiniMax-M2 is 100% open-source, free for developers right now. I have shared the link to their GitHub repo in the first comment. You will also find the code for the playground and evaluations I've done. _____ Share this with your network if you found this insightful ♻️ Follow me (Akshay Pachaar) for more insights and tutorials on AI and Machine Learning!

21 Comments
Like Comment
Chris Poulter

Leaders in Open Source Intelligence

9,798 followers 10mo
Report this post
The hot debate the last couple of days is DeepSeek vs “Western” models. Leaving aside financial impacts, there are interesting points about nation-state competition and whether you should use DeepSeek. Two key issues are potential bias or censorship in its training data (e.g., ignoring Tiananmen Square) and concerns about whether your data ends up in China. These matter, but note that DeepSeek-R1 is open-source, so you can run it offline/locally or in a secure environment, without using DeepSeek the App. 🤔 Remember, an app is just an interface to the underlying model. ChatGPT is a US-based interface for OpenAI’s closed-source models (4o, o1, etc.). Mistral is French. Using those apps means you share data with their providers (potentially for further training or other reasons). Alternatively, you can use open-source models (like DeepSeek) with free, offline “chat” interfaces. ✅ Two of the easiest to use are GPT4All (https://lnkd.in/g4ANvWjD) and AnythingLLM (https://anythingllm.com/). They provide GUIs for Windows, Linux, and MacOS - no complex command-line steps needed. You just install, download the model you want (often from huggingface), and start chatting. Both also offer Retrieval Augmented Generation (RAG), so you can load your own documents offline, build embeddings automatically, and then chat securely while referencing your own materials. If you want the power of open-source models like DeepSeek but want to avoid data privacy issues, these tools are worth trying. Additionally, AnythingLLM includes Agents for scraping websites and browsing within your chats. I’ll be running a webinar soon on how this benefits OSINT collection and analysis. 🤔 There’s also controversy about using models from adversarial nations or models trained in ways that may conflict with “our” social norms. Without debating any specific stance, there’s value in testing how such models might derive conclusions or support decision cycles - especially important for intelligence analysts who must consider how adversaries might use AI. It’s akin to studying adversarial doctrine, but more dynamic. Likewise, cyber threat-actors are developing custom GPTs. Understanding their capabilities is crucial for cybersecurity teams to form effective mitigations. ✅ Ultimately, DeepSeek’s alternative development approach is a net positive. It demonstrates potential compute and energy savings vital to sustainability, and lowers the barrier to entry for those without the resources of Big Tech. Startups can now fine-tune advanced open-source models with reduced CAPEX/OPEX, bringing new solutions to market faster. This also helps investors diversify AI funding and support a broader array of startups. Growing the AI ecosystem benefits everyone tackling “wicked problems” worldwide, as AI will undoubtedly play a direct or indirect role in solving them. #OSINT #GAI #AI
7 Comments
Like Comment
Kaushik Shakkari

Lead Applied Data Scientist | AI Manager | Championing Traditional & LLM-Powered GenAI Solutions for Unstructured Data | Mentor & AI Thought Leader | Driving Product Innovation & Digital Transformation

8,118 followers 2y
Report this post
I often get this question from industry friends and peers - Should we lean on commercial like OpenAI's black-box API or immerse ourselves in open-source LLMs? 🌍🤔 For those in the prototype phase 🛠: **Use OpenAI's API**: - It’s budget-friendly for early-stage projects. 🪙 - A single API key, and voila, integration is complete! 🔑 - This allows for swift product idea validation.✅ However, for long-term visionaries 🚀🌌: **Invest in open-source local LLMs**: - Establish a custom LLM setup within your cloud infrastructure. ☁️ - Focus on curating top-notch datasets that resonate with your business objectives. 📊🎯 Considering this pivot? Here’s why 🤓: 1️⃣ **Optimal Performance & Savings**: Tailored models often surpass giants like GPT4, especially with a specific dataset. They're not only effective but also economically wise. 💡💰 2️⃣ **Guardian of Data**: In our data-driven age, LLMs thrive on premium data. Ensure your data’s privacy by keeping it close. 🛡️🔒 Sending data over third-party channels might expose vulnerabilities. 🚫 3️⃣ **Flexibility in Strategy**: Transitioning back to APIs like OpenAI's is straightforward. Yet, initiating a proprietary LLM can be more complex later on. Hence, current investment paves the way for future adaptability. 🌳🔄 4️⃣ **Customization & Control**: With open-source LLMs, you have the autonomy to tweak and refine models to your heart's content, ensuring they align perfectly with your requirements. 🎛️🔧 5️⃣ **In-House Expertise**: Building in-house capabilities elevates your team's knowledge, making them not just users but experts in LLM technology. 🎓💼 6️⃣ **Future-Proofing**: Technology and regulations are evolving. Having control over your LLM means you can swiftly adapt to changes without waiting for third-party updates. ⏱️📈 7️⃣ **Cost Predictability**: With third-party APIs, costs can surge based on usage. In-house LLMs allow for more predictable budgeting in the long run. 📉💼 Would love to engage in a discussion and get insights from others in the field. Drop your thoughts below! 💭 #llms #languagemodels #openai #genai #deployment #production #datascience #artificialintelligence #largelanguagemodels

6 Comments
Like Comment
Liora Kern

CEO @The Think Room. Want to stay visible, credible & human in a world run by algorithms? We unlock your DNA, so you stand out in a tsunami of sameness. Then we make you a LinkedIn rockstar 🎸found by LLMs.

8,397 followers 5mo
Report this post
🚨The biggest risk isn’t that you fall behind on AI. It’s that you build everything on someone else’s infrastructure, and can’t afford to leave when you need to. Relying entirely on third-party AI might feel fast and flexible, but it comes with hidden risks. 🔹 What happens if your provider goes offline overnight? (ChatGPT?) 🔹 What if they triple prices once you’re locked in? (Copilot?) 🔹 What if political decisions restrict usage across borders? (All American LLMs?) 🔹What if you have trained an AI on your knowledge, voice and systems and you need to move? Adopting AI is one thing, turning it into a long-term asset is something else entirely. Here’s why ownership matters more than ever, and what it actually looks like in practice: 1. Protect yourself from platform risk OpenAI has faced internal conflicts and is navigating major financial and regulatory pressure. If a company like that folds, or becomes US-only, your entire workflow could be compromised. 2. Own your intelligence Every prompt your team types into ChatGPT helps improve OpenAI’s model. You’re teaching their system, not yours. If you build on open-source models like Mistral AI, Llama or Mixtral, every query and correction sharpens your own intelligence layer. Over time, it becomes more accurate, more aligned and more useful, just for you. 3. Control your costs 💰 Copilot already costs €30 per user each month. That adds up quickly, especially for growing teams. And this is just the beginning. Pricing is expected to increase as soon as companies are firmly locked in. With your own model, you manage infrastructure costs directly. 4. Guarantee data security APIs often come with unclear data handling. Even if they claim not to store your inputs, there’s rarely full transparency. Hosting your own model means your data never leaves your environment. No grey zones. 5. Build a strategic edge Generic models are trained for general use. They don’t speak your sector’s language or understand your internal logic. With a proprietary model, your model learns your world. 6. Future-proof your strategy The AI landscape is shifting fast. Today’s most dominant players might not be tomorrow’s. Don't build everything around one provider. Creating your own capability keeps you flexible. You can adapt, test, or even integrate across providers without starting from scratch. 7. Align your tech with your values Language matters. So do ethics. With your own model, you decide how it communicates, what it avoids, and what kind of guidance it gives. 📌 The bottom line Using AI is no longer enough. The real value comes from building intelligence you actually own, trained on your data, aligned with your standards, and designed for your future. Can you afford NOT to invest in your own future? Doron Themans Thomas Van Oekelen Bart Van Mol Sean Hayes The Think Room 👋🏼 I’m Liora. I help professionals & organisations stay visible, credible, and human in a world run by algorithms. #communication #AI

21 Comments
Like Comment
Shanti Mohan

Founder CEO @LVX || General Partner, SamVed

29,107 followers 10mo
Report this post
Evaluating Deepseek for #Founders: Data Privacy, Bias, and Adoption Considerations I am part of a #founder WhatsApp group where an interesting discussion arose around Deepseek's privacy settings and data access requirements. This post is motivated by that discussion. Though DeepSeek (DS) has gained traction for its high performance at a lower cost, reportedly rivalling OpenAI’s O1 model, there are concerns around bias, censorship, and data privacy. 1. The DS Model: Faster and Open-Source, but What About Bias? DS is an open-source with "permissive licensing, allowing commercial use". However, its training data remains undisclosed, raising concerns about potential bias from #Chinese political, cultural, or regulatory influences. For companies building global or unbiased applications, this lack of transparency presents a risk. Bias in foundational models is difficult to correct without full retraining which is resource-intensive. 2. Hosting Options and Data Privacy: Who Controls Your Data? - Official DS App (Chinese Servers) → High risk for sensitive data due to Chinese data regulations. - Third-Party Hosting (Perplexity, Hugging Face, etc.) → Hosted on US/EU servers, shifting risks to these providers. - Self-Hosting (On-Prem or Private Cloud) → Best for privacy and control, but requires infrastructure and engineering resources. 3. Where Should DS Be Used? This applies to beyond DS decision making. ✅ Early-Stage Startups (MVP Testing) Use DS, Llama, or Mistral to test ideas cheaply. Best for non-sensitive applications (chatbots, summarization, RAG). Hosting via Perplexity, Hugging Face, etc. is viable. ✅ Growth-Stage Startups (Scaling AI in Production) Invest in self-hosting for better model control and data privacy. Hybrid approach: Fine-tune open models on proprietary data while keeping inference on secure servers. ✅ Enterprise AI Deployment Self-hosting is essential for enterprises dealing with customer data or regulated industries. Likely to combine fine-tuned proprietary models with open-source alternatives. For startups, early testing can be done via third-party platforms, but production use should prioritize data control. DS is a cost-effective but nuanced trade-off. Bias and data privacy risks must be weighed carefully. Fine-tuning will be key, as long-term differentiation comes from proprietary datasets and custom AI models, not just off-the-shelf solutions. Finally, an important lesson is a lot of success in startups and user adoption is based on perception and trust. Would love to hear thoughts—what AI models are you betting on? LetsVenture SamVed VC #Founders #AIDecisions
5 Comments
Like Comment
Avivah Litan

13,860 followers 2y
Report this post
Meta’s release of #llama2 Open Source #genai Model can solve Many Data Security problems LLaMa2's release for commercial use clearly threatens OpenAI and Hyperscaler-owned GenAI. It is rocking the GenAI universe. Its performance is closing in on GPT-4 and Claude-2. See https://lnkd.in/eNWpj8Mr When run privately, Llama 2 solves the nagging and persistent privacy and data confidentiality problems organizations have in trusting company secrets to third party GenAI model hosters like Microsoft, OpenAI, Google and Amazon Web Services (AWS). Gartner clients are justifiably concerned about data privacy when it comes to using hosted LLMs. Their black boxes don't instill confidence that prompts, completions and other enterprise data is safe and secure. It requires that users TRUST WITH NO ABILITY TO VERIFY. There are also too many problems with #llm hallucinations, copyright issues, misinformation, liability, libel, and plain old inaccuracies. Privately hosted open source models can solve many of these problems more easily than trying to adjust the use of public hyperscaler models for enterprise security and risk management requirements., We think security, privacy and risk management requirements will drive adoption of open source models like LLaMa2 that can be privately hosted. We talk with many Gartner clients every day who find this option attractive, provided they can find the resources to manage it. Of course, new skill sets will be required to make privately hosted GenAI models work effectively, and that costs money. But those costs will surely be offset by third party hosting vendor transaction and license fees, and costs to ensure data confidentiality, security, and privacy. After all, if and when company secrets are breached in a hosted LLM, the enterprise user is stuck with the resulting costs and liabilities. Third party GenAI hosting vendors are absolved. Open Source GenAI models may create other security and safety issues caused by hackers and other bad actors, but it starts to address everyday issues that enterprise users face. https://lnkd.in/eR8WeU5i #cybersecurity #aisecurity #datasecurity #ai
13 Comments
Like Comment
Jessica S.

Doctoral Candidate, Strategic Intelligence | OSINT Expert

3,830 followers 4w
Report this post
Reuters reported this week that OpenAI is fighting a court order to turn over about 20 million ChatGPT conversations tied to a lawsuit. Read it here: https://lnkd.in/eriJiKrC It's so easy to treat GPTs like an assistant, friend, or colleague. It's also easy for us to forget that we are subscribers to these tools and instead of the owner, which means we don't keep control of our information. Unfortunately, this lawsuit is a huge reminder that not all versions of these tools work the same way and that people don't know (or don't care until it's too late) how their data is handled behind the scenes. The consumer versions of ChatGPT, both free and paid, keep interaction data to train the model (yes, you can control this a little bit, but I'd be curious to see what data really is retained on the back end that may be turned over in this suit). Enterprise and API setups can be configured for zero data retention, which means the prompts and outputs are deleted after processing- but do your employees know that and are they using your model and not the publicly available one? If you work with client information, case material, or anything operational, it should not run through a consumer account. Keep your exploratory testing there if you need it, but move real work to enterprise or API environments where you know how the data is handled. Keep your own record of what you did, when you did it, and which model you used so your process is clear and repeatable. You also have the option to host your own AI models offline so you retain full control over your data. Running open source models on your own hardware gives you direct control over what is stored, deleted, or logged. Be deliberate about where you put your data. I imagine it's not long before something becomes public about someone violating an NDA by putting protected information in a public system like this and training a model on it by accident. Be careful out there! #OSINT #AI #DataPrivacy #Governance #Investigations #OpenSourceIntelligence #PangeaResearch

OpenAI fights order to turn over millions of ChatGPT conversations reuters.com

2 Comments
Like Comment
Nicholas Nouri

Founder | APAC Entrepreneur of the year | Author | AI Global talent awardee | Data Science Wizard

131,205 followers 10mo
Report this post
Who Owns “Public” Data and How Far Does That Right Extend? Recent news suggests Microsoft and OpenAI may be investigating whether certain banned accounts - allegedly tied to DeepSeek - used OpenAI’s API to “distill” or replicate their large language models without having to shoulder the full cost of scraping and training. Meanwhile, OpenAI itself has faced legal questions about whether it used copyrighted materials freely available online to train its own models. The Core Tension - Public Data vs. Proprietary Access: If OpenAI’s position is that training on vast amounts of publicly available info is acceptable - even critical - for AI’s advancement, why shouldn’t other companies do the same? - Cost and Early Investment: OpenAI’s argument might rest on the idea that it invested heavily to develop these models first, so it’s unfair for others to bypass the same expense. Yet the principle of “public data for all” can’t be enforced selectively; either these datasets are fair game or they’re not. - Distillation as a Long Standing Technique: Knowledge distillation - where one model is used to refine or train another, often with reduced computational load - isn’t new. It’s been a recognized approach in AI for a while and is undeniably cost-effective. Legal and Ethical Questions - Copyright and Permissions: If creators have not explicitly granted permission for their content to be used as training data, does wide availability on the internet automatically grant AI developers the right to harvest it? - Level Playing Field: Should early movers in AI be able to limit others from using similar public data sources - or from employing recognized techniques like distillation - just because they invested heavily upfront? Ultimately, this debate touches on transparency, intellectual property, and the very norms of AI development. It raises the question: if open data is crucial for advancement, how do we ensure consistent rules that encourage innovation while respecting ethical and legal boundaries? Should companies be free to train their models on any public data or API output, or are more nuanced guardrails necessary to keep the playing field fair and innovation-driven? #innovation #technology #future #management #startups

4 Comments
Like Comment
Beth Kanter Beth Kanter is an Influencer

Trainer, Consultant & Nonprofit Innovator in digital transformation & workplace wellbeing, recognized by Fast Company & NTEN Lifetime Achievement Award.

521,265 followers 1mo
Report this post
This Stanford study examined how six major AI companies (Anthropic, OpenAI, Google, Meta, Microsoft, and Amazon) handle user data from chatbot conversations. Here are the main privacy concerns. 👀 All six companies use chat data for training by default, though some allow opt-out 👀 Data retention is often indefinite, with personal information stored long-term 👀 Cross-platform data merging occurs at multi-product companies (Google, Meta, Microsoft, Amazon) 👀 Children's data is handled inconsistently, with most companies not adequately protecting minors 👀 Limited transparency in privacy policies, which are complex and hard to understand and often lack crucial details about actual practices Practical Takeaways for Acceptable Use Policy and Training for nonprofits in using generative AI: ✅ Assume anything you share will be used for training - sensitive information, uploaded files, health details, biometric data, etc. ✅ Opt out when possible - proactively disable data collection for training (Meta is the one where you cannot) ✅ Information cascades through ecosystems - your inputs can lead to inferences that affect ads, recommendations, and potentially insurance or other third parties ✅ Special concern for children's data - age verification and consent protections are inconsistent Some questions to consider in acceptable use policies and to incorporate in any training. ❓ What types of sensitive information might your nonprofit staff share with generative AI? ❓ Does your nonprofit currently specifically identify what is considered “sensitive information” (beyond PID) and should not be shared with GenerativeAI ? Is this incorporated into training? ❓ Are you working with children, people with health conditions, or others whose data could be particularly harmful if leaked or misused? ❓ What would be the consequences if sensitive information or strategic organizational data ended up being used to train AI models? How might this affect trust, compliance, or your mission? How is this communicated in training and policy? Across the board, the Stanford research points that developers’ privacy policies lack essential information about their practices. They recommend policymakers and developers address data privacy challenges posed by LLM-powered chatbots through comprehensive federal privacy regulation, affirmative opt-in for model training, and filtering personal information from chat inputs by default. “We need to promote innovation in privacy-preserving AI, so that user privacy isn’t an afterthought." How are you advocating for privacy-preserving AI? How are you educating your staff to navigate this challenge? https://lnkd.in/g3RmbEwD
16 Comments
Like Comment

Data Privacy Concerns in Open vs Proprietary Models

Summary

More in Navigating Data Privacy

Explore categories