Large language models (LLMs) have revolutionized AI application development, but they come with significant challenges. Chief among these is the tendency of these models to produce plausible but false information. The common term for this phenomenon, “hallucinations,” doesn’t fully capture the nature of these inaccuracies. Another crucial aspect of AI development is the evaluation of models, which traditionally relies on extensive human annotation. However, recent advancements in using synthetic data for autoevaluation promise substantial efficiency gains. This article delves into two recent papers that explore these issues, offering practical insights and recommendations for AI teams.
Part I: Hallucinations vs. Bullshit
The paper “ChatGPT is Bullshit” presents a compelling argument that the term “hallucinations” is a misnomer when describing the inaccuracies produced by ChatGPT and similar LLMs. Drawing on philosopher Harry Frankfurt’s definition, the authors assert that “bullshit” is a more accurate term. Frankfurt describes bullshit as a disregard for the truth, where the speaker is indifferent to the accuracy of their statements. This is contrasted with hallucinations, which imply a false perception or belief about reality.

The authors demonstrate that ChatGPT generates statements without regard for their truth value, aligning with Frankfurt’s concept of bullshit. This indifference to truth is not a failure to perceive reality correctly but a fundamental characteristic of the model’s design. The paper classifies bullshit into “hard” and “soft” categories, with ChatGPT producing at least soft bullshit and potentially hard bullshit depending on the perceived intentions.
Analysis
Use Appropriate Terminology:
- AI developers, researchers, and communicators should adopt the term “bullshit” to describe the inaccuracies produced by LLMs. This term more accurately reflects the nature of these errors and avoids the misleading implications of terms like “hallucinations” or “confabulations.”
- Clear communication and appropriate response strategies depend on accurate terminology. Referring to LLM falsehoods as “bullshit” highlights the models’ indifference to truth, setting appropriate expectations for users and policymakers.
Address Indifference to Truth in AI Design:
- Focus on creating models that prioritize accuracy and reliability. Current methods to improve accuracy, such as linking LLMs to databases, are insufficient to address the root cause of truth indifference.
- Enhancing the truth-orientation of AI models can reduce misleading outputs and increase user trust. AI teams must take responsibility for model outputs and implement techniques like fact-checking integrations and stringent training protocols.
By distinguishing between “hallucinations” and “bullshit” in LLM outputs, developers can better understand inaccuracies and devise targeted solutions
Implement Strict Validation Processes:
- AI teams must develop robust validation mechanisms to mitigate the risks associated with LLMs’ tendency to generate plausible-looking but potentially inaccurate outputs. Just as the term “bullshit” highlights LLMs’ indifference to truth, strict validation processes underscore the importance of verifying model outputs before deployment.
- By prioritizing accuracy and reliability, AI teams can reduce the instances of “bullshit” generated by LLMs and maintain user trust in AI-powered systems.
Part II: AutoEval Done Right
The second paper, “AutoEval Done Right: Using Synthetic Data for Model Evaluation,” explores the efficiency and reliability of autoevaluation methods using AI-labeled synthetic data for model evaluation. Autoevaluation can significantly reduce the need for human annotations, saving time and costs. However, the trustworthiness of synthetic labels can be an issue, especially for high-stakes AI applications.
To address this, the authors introduce Prediction-Powered Inference (PPI), a statistical tool that combines a small human-labeled dataset with a large synthetic dataset to obtain unbiased estimates of model performance with lower variance. PPI measures and corrects biases in synthetic data, allowing AI teams to benefit from autoevaluation while maintaining statistical validity.
The paper demonstrates the applicability of autoevaluation methods to different model types and metrics, showcasing their versatility across various domains and evaluation tasks.

While the authors did not use PPI to comprehensively evaluate a single LLM end-to-end, they did demonstrate using PPI with synthetic labels from GPT-4 to more efficiently rank and compare multiple LLMs based on pairwise human preferences. The synthetic data from GPT-4 allowed them to leverage more comparisons beyond just the human-labeled ones to get better estimates of the relative performance of the LLMs.
PPI with GPT-4’s synthetic labels efficiently ranked and compared multiple LLMs based on pairwise human preferences, improving performance estimate
Specifically, the paper discusses how they applied PPI to rank language models on the Chatbot Arena dataset. This dataset includes human and GPT-4 preferences over pairs of LLM answers to the same prompts. Using synthetic data helped them create more precise and consistent evaluations of model performance, making their estimates more reliable and less variable than traditional methods. This approach also allowed them to make better use of available data, resulting in improved accuracy of their assessments.
Analysis
Adopt PPI-Powered Autoevaluation:
- AI teams should use PPI-based autoevaluation for model comparisons on various metrics. Existing Python packages facilitate this implementation. This approach enables faster, cost-effective iteration on model development while ensuring statistically valid comparisons.
Combine PPI Autoevaluation with Limited Human Evaluation:
- Use a small human-labeled validation set to measure the bias of synthetic labels, then correct this bias using PPI while evaluating models on the larger synthetic set. This method balances efficiency gains from synthetic data with the grounding of human judgments, providing optimal evaluation results.
Develop High-Quality Annotator Models:
- Invest in high-quality annotator models to ensure the accuracy of autoevaluation methods, as the quality of the annotator model impacts the performance of autoevaluation. Reliable annotator models maximize the benefits of autoevaluation, ensuring accurate model evaluations.
Extend PPI Algorithms for Handling Distribution Outputs:
- Extend PPI algorithms to handle cases where the annotation model outputs a distribution over outcomes rather than a single outcome. This enhances the flexibility and applicability of PPI, allowing AI teams to work with more complex data structures and improve evaluation methods across various AI applications.
Closing Thoughts
In the pursuit of creating more accurate, reliable, and trustworthy AI systems, it is crucial for developers to understand the nature of AI-generated inaccuracies and adopt effective strategies to mitigate them. By distinguishing between “hallucinations” and “bullshit” in LLM outputs, developers can gain a clearer understanding of the types of inaccuracies they are dealing with and devise targeted solutions.
Hallucinations refer to instances where the AI system generates content that appears plausible but is not grounded in reality or the training data. On the other hand, bullshit encompasses outputs that are not only inaccurate but also nonsensical or irrelevant to the given context. Recognizing this distinction is essential for AI teams to prioritize their efforts in addressing these issues effectively.
One powerful tool that has emerged in recent years is the use of synthetic data for autoevaluation, particularly through the application of Prediction-Powered Inference (PPI). PPI-powered autoevaluation enables AI systems to assess their own outputs and identify potential inaccuracies or inconsistencies. By leveraging this approach, AI practitioners can efficiently detect and correct errors, leading to more reliable and trustworthy AI systems.
Related Content
- Learning from the Past: Comparing the Hype Cycles of Big Data and GenAI
- A Critical Look at Red-Teaming Practices in Generative AI
- Reducing AI Hallucinations: Lessons from Legal AI
- Unraveling the Black Box: Scaling Dictionary Learning for Safer AI Models
- The Art of Forgetting: Demystifying Unlearning in AI Models
If you enjoyed this post please support our work by encouraging your friends and colleagues to subscribe to our newsletter:
