Article from NY Times: More than two years after ChatGPT's introduction, organizations and individuals are using AI systems for an increasingly wide range of tasks. However, ensuring these systems provide accurate information remains an unsolved challenge. Surprisingly, the newest and most powerful "reasoning systems" from companies like OpenAI, Google, and Chinese startup DeepSeek are generating more errors rather than fewer. While their mathematical abilities have improved, their factual reliability has declined, with hallucination rates higher in certain tests. The root of this problem lies in how modern AI systems function. They learn by analyzing enormous amounts of digital data and use mathematical probabilities to predict the best response, rather than following strict human-defined rules about truth. As Amr Awadallah, CEO of Vectara and former Google executive, explained: "Despite our best efforts, they will always hallucinate. That will never go away." This persistent limitation raises concerns about reliability as these systems become increasingly integrated into business operations and everyday tasks. 6 Practical Tips for Ensuring AI Accuracy 1) Always cross-check every key fact, name, number, quote, and date from AI-generated content against multiple reliable sources before accepting it as true. 2) Be skeptical of implausible claims and consider switching tools if an AI consistently produces outlandish or suspicious information. 3) Use specialized fact-checking tools to efficiently verify claims without having to conduct extensive research yourself. 4) Consult subject matter experts for specialized topics where AI may lack nuanced understanding, especially in fields like medicine, law, or engineering. 5) Remember that AI tools cannot really distinguish truth from fiction and rely on training data that may be outdated or contain inaccuracies. 6)Always perform a final human review of AI-generated content to catch spelling errors, confusing wording, and any remaining factual inaccuracies. https://lnkd.in/gqrXWtQZ
Reasons Authors Are Concerned About AI Training Data
Explore top LinkedIn content from expert professionals.
Summary
Authors and creators are increasingly worried about how artificial intelligence (AI) systems use their work for training purposes, raising concerns about copyright infringements, data quality, and misinformation. AI models often rely on vast amounts of training data sourced from the internet, which may include copyrighted, inaccurate, or biased information, thereby posing risks to content creators, businesses, and consumers alike.
- Ensure data integrity: Prioritize the use of high-quality, verified, and ethically sourced data to prevent inaccuracies and potential misuse in AI systems.
- Advocate for copyright respect: Support transparent agreements that give creators control over how their content is used for training AI, ensuring fair compensation and acknowledgment.
- Improve content monitoring: Regularly audit training datasets for biases, outdated information, or other inaccuracies that could compromise the reliability of AI systems.
-
-
If all you're monitoring is your agent's outputs, you're fighting a losing battle. Beyond even embedding drift, output sensitivity issues, and the petabytes of structured data that can go bad in production, AI systems like agents bring unstructured data into the mix as well — and introduce all sorts of new risks in the process. When documents, web pages, or knowledge base content form the inputs of your system, poor data can quickly cause AI systems to hallucinate, miss key information, or generate inconsistent responses. And that means you need a comprehensive approach to monitoring to resolve it. Issue to consider: - Accuracy: Content is factually correct, and any extracted entities or references are validated. - Completeness: The data provides comprehensive coverage of the topics, entities, and scenarios the AI is expected to handle, where gaps in coverage can lead to “I don’t know” responses or hallucinations. - Consistency: File formats, metadata, and semantic meaning are uniform, reducing the chance of confusion downstream. - Timeliness: Content is fresh and appropriately timestamped to avoid outdated or misleading information. - Validity: Content follows expected structural and linguistic rules; corrupted or malformed data is excluded. - Uniqueness: Redundant or near-duplicate documents are removed to improve retrieval efficiency and avoid answer repetition. - Relevance: Content is directly applicable to the AI use case, filtering out noise that could confuse retrieval-augmented generation (RAG) models. While a lot of these dimensions mirror data quality for structured datasets, semantic consistency (ensuring concepts and terms are used uniformly) and content relevance are uniquely important for unstructured knowledge bases where clear schemas and business rules don't often exist. Of course, knowing when an output is wrong is only 10% of the challenge. The other 90% is knowing why and how it resolve it fast. 1. Detect 2. Triage. 3. Resolve. 4. Measure. Anything less and you aren't AI-ready. #AIreliability #agents
-
Here are a few simple truths about Data Quality: 1. Data without quality isn't trustworthy 2. Data that isn't trustworthy, isn't useful 3. Data that isn't useful, is low ROI Investing in AI while the underlying data is low ROI will never yield high-value outcomes. Businesses must put an equal amount of time and effort into the quality of data as the development of the models themselves. Many people see data debt as another form of technical debt - it's worth it to move fast and break things after all. This couldn't be more wrong. Data debt is orders of magnitude WORSE than tech debt. Tech debt results in scalability issues, though the core function of the application is preserved. Data debt results in trust issues, when the underlying data no longer means what its users believe it means. Tech debt is a wall, but data debt is an infection. Once distrust drips in your data lake, everything it touches will be poisoned. The poison will work slowly at first and data teams might be able to manually keep up with hotfixes and filters layered on top of hastily written SQL. But over time, the spread of the poison will be so great and deep that it will be nearly impossible to trust any dataset at all. A single low-quality data set is enough to corrupt thousands of data models and tables downstream. The impact is exponential. My advice? Don't treat Data Quality as a nice to have, or something that you can afford to 'get around to' later. By the time you start thinking about governance, ownership, and scale it will already be too late and there won't be much you can do besides burning the system down and starting over. What seems manageable now becomes a disaster later on. The earliest you can get a handle on data quality, you should. If you even have a guess that the business may want to use the data for AI (or some other operational purpose) then you should begin thinking about the following: 1. What will the data be used for? 2. What are all the sources for the dataset? 3. Which sources can we control versus which can we not? 4. What are the expectations of the data? 5. How sure are we that those expectations will remain the same? 6. Who should be the owner of the data? 7. What does the data mean semantically? 8. If something about the data changes, how is that handled? 9. How do we preserve the history of changes to the data? 10. How do we revert to a previous version of the data/metadata? If you can affirmatively answer all 10 of those questions, you have a solid foundation of data quality for any dataset and a playbook for managing scale as the use case or intermediary data changes over time. Good luck! #dataengineering
-
The "Where's the Beef?" Problem: Why 85% of #AI Projects Are Failing In 1984, a simple question, "Where's the beef?" brilliantly highlighted a lack of substance. Forty years later, the world of #ArtificialIntelligence faces its substance crisis, and the question is now: "Where's the data?" Organizations are investing billions in sophisticated models and generative AI, only to be disappointed by the results. The issue isn't the algorithm; it's the ingredients. We're building AI on a diet of incomplete, #biased, or #syntheticdata and expecting revolutionary outcomes. The Critical Lesson for Leaders: High-quality, real-world data is not just a component of an AI strategy; it is the foundation. According to #IBM and other industry reports, a staggering 85% of AI projects fail to deliver on their promises, with poor data quality being the primary culprit. This is the difference between an AI that is a parlor trick and one that drives real-world clinical and business value. To achieve true digital transformation, we must shift our focus from just building models to curating the data that fuels them. It’s time to demand substance. It's time to feed your clinic real data. What are the most significant data challenges you face in your AI initiatives? I'd like tos discuss below. #DrGPT #ArtificialIntelligence #HealthcareAI #HealthIT #DataAnalytics #MachineLearning #DigitalTransformation #Leadership #Innovation #RealWorldData #ClinicalData #BigData #wendys #wheresthebeef Arlen Meyers, MD, MBA, Scott J. Campbell MD, MPH Ramon Julian Pesigan MD MBA
-
𝗪𝗵𝗮𝘁 𝗽𝗶𝘇𝘇𝗮 𝗮𝗻𝗱 𝗰𝗵𝗲𝗲𝘀𝗲 𝘁𝗲𝗮𝗰𝗵 𝘂𝘀 𝗮𝗯𝗼𝘂𝘁 𝗱𝗮𝘁𝗮 𝗾𝘂𝗮𝗹𝗶𝘁𝘆: LLM providers have been training their models on public data, for example from Twitter and Reddit, leading to concerns over the contents they’ve learned from. So, they have been striking licensing deals with content providers to get access to their data — and that creates new challenges. Datasets obtained from the public Internet contain false information, sarcasm, and potentially harmful content. Given that Generative AI, unlike humans, has no understanding of common sense and nuance, this can backfire quickly. An AI-augmented Google search has recently recommended: adding non-toxic glue to your pizza to prevent the cheese from sliding off. (Don’t try this at home.) The Internet has traced the information back to a decade-old thread on Reddit that the model has presumably processed and incorporated into its AI-generated output. Think about autonomous agents that will book your travel, negotiate a contract with your supplier, or provide information about your products, parts, and warranties. Mishaps for any of these examples due to bad data can have a real impact on your business — from ending up in the wrong location at the wrong time to overpaying, causing damage to your customers’ assets, and more. Spending extra effort to review, clean, and correct your datasets remains key. So does attributing generated information to the exact source document or dataset. That way, your users have a reference point to verify if the generated output is actually correct. Otherwise, you might end up with the equivalent business outcome of suggesting to add glue to prevent cheese from sliding off of your pizza. A sticky situation. Read the article 👇🏻 for the full details and get the next one in your inbox tomorrow. 𝗜𝘀 𝘁𝗵𝗲 𝗼𝗹𝗱 𝘀𝗮𝘆𝗶𝗻𝗴 𝗲𝘃𝗲𝗿 𝗺𝗼𝗿𝗲 𝗿𝗲𝗹𝗲𝘃𝗮𝗻𝘁? —> “𝘋𝘰𝘯’𝘵 𝘵𝘳𝘶𝘴𝘵 𝘦𝘷𝘦𝘳𝘺𝘵𝘩𝘪𝘯𝘨 𝘺𝘰𝘶 𝘳𝘦𝘢𝘥 𝘰𝘯 𝘵𝘩𝘦 𝘐𝘯𝘵𝘦𝘳𝘯𝘦𝘵.” #ArtificialIntelligence #GenerativeAI #IntelligenceBriefing
-
The Risks of Misinformation in AI: Lessons from Healthcare for AdTech and MarTech. A recent study from NYU highlights a critical vulnerability in large language models (LLMs): if even 0.001% of training data is poisoned with misinformation, the integrity of the entire model can be compromised. While the study focuses on biomedical LLMs” where the stakes are life and death” it serves as a stark warning for all industries leveraging AI. In healthcare, unconsented, low-quality, or maliciously injected data can lead to disastrous outcomes, compromising patient safety and eroding trust. The research underscores the importance of data provenance, transparency, and stringent safeguards in training AI models. This cautionary tale is highly relevant for the AdTech and MarTech industries. Just as healthcare professionals must ensure that AI tools don’t hallucinate medical advice, marketing professionals must ensure AI-driven decisions don’t propagate biased or tainted insights. Poorly sourced or un-consented data can erode consumer trust, violate privacy regulations, and undermine brand integrity issues that are only becoming more critical as privacy regulations tighten and third-party data becomes obsolete. The healthcare sector’s rigorous focus on accuracy and data compliance could be a leading indicator for the future of AdTech and MarTech. Consider these parallels: Consent is King: In healthcare, patient data must be protected and shared transparently. Similarly, marketing leaders must adopt robust consent frameworks to build consumer trust. Provenance Matters: Just as healthcare LLMs require clean, verifiable training data, marketing models must be free of unverified or compromised data sources. Risk Amplification: Misinformation in an LLM doesn’t just stop at flawed outputs, it perpetuates systemic failures across its entire ecosystem, whether it’s diagnosing illnesses or driving ad spend. At the core of this challenge lies a fundamental truth: privacy, accuracy, and performance must coexist in any AI-powered system. As industries like healthcare push the boundaries of safeguarding sensitive data, it’s a call to action for marketing and advertising to take the same approach building models that not only drive results but also uphold ethical standards. The path forward isn’t going to be easy, but it’s clear: data transparency, quality assurance “PROOF”, and compliance are no longer optional. They’re the foundation for innovation, trust, and long-term success. #IABALM2025 #privacy #consent #proofofprovenance #advertising #marketing Precise.ai Qonsent https://lnkd.in/egjwwps2
-
The New York Times just dropped a bombshell report, revealing a desperate AI race in which Big Tech is taking real legal risks, and a lack of high-quality training data threatens the future of AI. TL; DR—More data = better AI models, so Google, Facebook, and OpenAI have all been doing sketchy stuff to ingest more data. ▶ There’s a good chance both Google and OpenAI violated the copyright of YouTube creators by harvesting their videos for AI training. And OpenAI most likely violated Google’s terms of service. ▶ Google knew what OpenAI was doing but didn’t say anything because they didn’t want to bring attention to the fact that they were maybe probably violating copyright law. ▶ In an effort to keep up, Meta considered BUYING SIMON AND SCHUSTER to use its books as training data — which is seriously making me reconsider going with a publisher vs. self-publishing for my next book. ▶ OpenAI CEO Sam Altman publicly acknowledges that we’re running out of high-quality, human-generated text to use as training data, and this is a huge problem because training AI models on AI-created text seems to cause model collapse. ▶ In turn, Google is maybe mining our Google docs for data? No one really knows! The big implications are this: 1) The valuation of tech companies with AI models relies upon the idea that their use of copyrighted material as training data isn't a copyright violation. If that changes, the entire AI ecosystem collapses. 2) A lack of high-quality training data is a real risk. Meta, Google, and OpenAI ARE likely going to run out of training data by 2026, and unless synthetic data gets better, it's unclear if these models can progress. (The proliferation of AI-generated content on the web is also a big threat). Incredible reporting by the Times, and a must read: https://lnkd.in/eA6VP9fp
-
Generative -- OR "degenerative" -- AI? A huge debate is whether AI progressively devolves without much more human oversight. Early evidence suggests that it does devolve, underscoring necessary human involvement. A great story in The New York Times explains that the increasing prevalence of AI-generated content on the internet is raising major questions, particularly regarding the potential degradation of AI systems over time. OpenAI company generates "about 100 billion words per day"—a volume equivalent to a million novels daily, according to Sam Altman. A portion of this content inevitably ends up online. As a result, it is increasingly difficult to distinguish between human and AI-generated content, and there are currently no foolproof methods to detect this type of information. What's the problem? AI systems may inadvertently train on their own generated content, creating a feedback loop that leads to a decline in quality—a phenomenon referred to as "model collapse." Some have found that GenAI trained on its own outputs can result in "a narrower range of AI output over time." This process, described by researchers as the model becoming "poisoned with its own projection of reality," can lead to various issues, such as distorted images or reduced linguistic diversity in AI-generated text. Moreover, this problem is not confined to text alone. Another study found that repeated training of image-generating AI on its own outputs led to "glitches and image artifacts," ultimately producing distorted images. ,The AI started "drifting into parts of the space that are like a no-fly zone," said Richard Baraniuk, one of the authors. The erosion of diversity in AI outputs is another critical issue. As AI systems rely more on synthetic data, their outputs become less varied and more prone to biases. This decline in diversity is described as "a hidden danger" and might not be noticed until it is too late. What's the morale of the story? The only worse thing than no data is bad data, so companies need to do first things first: before running (using GenAI), walk (establish reliable data pipelines). This could involve paying for curated data rather than indiscriminately harvesting content from the internet, but that too requires an overhaul since the market for vetting data has its own set of challenges. Lots of progress, but also important challenges to confront. #genai #aiethics #modelcollapse
-
You might not think Meta’s decision to end fact-checking is a big deal… but it is for AI training data. By relying on its community to verify information, Meta is fundamentally changing how context and accuracy are determined. AI models depend on vast amounts of data to learn and improve, but if the quality of that data becomes inconsistent or biased, it can lead to significant downstream issues, such as perpetuating misinformation or amplifying inaccuracies. This decision also adds complexity to the already challenging problem of misinformation in AI, especially given that models are often criticized for hallucinations—generating plausible-sounding but incorrect information. When the foundation of data is shaky, the ripple effects extend to how AI systems perform, make decisions, and are trusted in real-world applications. This move raises an essential question: can crowdsourcing trust scale without compromising reliability? What’s your take—progress or a potential pitfall?
-
Just published a new Forbes article on a topic I’ve been thinking about for months, but was inspired to write about due our #HumanIntelligence launch this week. AI is moving fast – too fast in some cases. But there’s a problem no one is talking about. It’s not the language models. It’s not the algorithms. It’s the data. #AI is only as good as the data we train it on. And, right now, most organizations are feeding AI flawed, outdated, and incomplete data. Performance reviews, transactional information, lagging indicators — it’s all data that’s in the rearview mirror. Worse, AI is starting to train itself and double down on its own errors. If we don’t fix this, we’re just embedding bias and inaccuracy at scale. This problem isn’t going to solve itself. As leaders, we need to rethink the data we value. At Workhuman, we’ve spent decades studying how recognition data – real, human moments of connection and impact – can surface the truth of how work happens. It’s a goldmine of authentic, human data that can change the game for how AI provides insights. It’s why we launched #HumanIntelligence 2025 – and why I wrote this piece. I believe AI can make work more human – but only if we get the data right.