Mitigating Hallucinations in LLMs by Enhancing Pre-training Data Quality

Ali Arsanjani
3 min readMar 22, 2024

To minimize hallucinations in large language models (LLMs) post-training, or post-fine-tuning it is crucial to attend to the quality of the dataset, its geometry, its content and to curate the pre-training data carefully. In this article we discuss several approaches to data curation that can help reduce the occurrence of hallucinations.

Manual curation

Yes, the seemingly obvious one: human-in-the-loop. Engaging human annotators to manually review and filter the pre-training data can help identify and remove fabricated, outdated, or biased information. This approach ensures high-quality data but can be time-consuming and resource-intensive, especially given the massive scale of pre-training corpora.

Automated filtering

Developing advanced automated filtering techniques can help identify and remove low-quality or unreliable data from the pre-training corpus. This can involve using machine learning algorithms to detect patterns indicative of misinformation, such as inconsistencies, contradictions, or linguistic anomalies. Automated filtering can be more scalable than manual curation but may not capture all instances of problematic data. Note that the detection of misinformation is a non trivial endeavor and deserves its own focus and attention. E.g., see project Alternusvera.

Source selection

Prioritizing data from reputable and authoritative sources can help minimize the presence of false or misleading information in the pre-training corpus. This can involve curating data from trusted news outlets, academic publications, government reports, and other verified sources. By focusing on high-quality sources, the likelihood of incorporating misinformation into the model’s knowledge base can be reduced.

Temporal filtering

Removing outdated information from the pre-training data is essential to prevent LLMs from generating content based on stale or inaccurate facts. Temporal filtering techniques can be employed to identify and exclude data that is no longer relevant or has been superseded by more recent information. This helps ensure that the model’s knowledge is up to date and aligned with the current state of the world.

Bias mitigation

Addressing biases present in the pre-training data is crucial to prevent LLMs from perpetuating stereotypes, prejudices, or discriminatory views. This can involve applying bias detection algorithms to identify and remove or rebalance biased content. This involves ensuring that we are incorporating diverse perspectives and fair representation of different demographic groups in the pre-training data .

Fact-checking and verification

Integrating fact-checking mechanisms into the data curation process can help identify and remove false or unverified information. This can involve cross-referencing data against trusted knowledge bases, employing fact-checking algorithms, or leveraging human fact-checkers to validate the accuracy of the pre-training data.

Continuous monitoring and updating

Regularly monitoring the pre-training data for emerging misinformation, outdated facts, or biases is essential to maintain the quality of the LLM’s knowledge base. Implementing a continuous data curation process that involves periodic updates and refinements can help keep the model aligned with the latest information and reduce the risk of hallucinations.

Collaboration with domain experts

Engaging domain experts in the data curation process can provide valuable insights and help identify domain-specific misinformation or biases. Collaborating with experts in various fields, such as science, healthcare, finance, and social sciences, can enhance the accuracy and reliability of the pre-training data.

The best-practices of implementing a combination of these data curation techniques, to identify and weed out the presence of false, outdated, or biased information in the pre-training data will prove fruitful down the line during generation/inference. This process will tend to reduce the likelihood of LLMs generating hallucinations based on overtly problematic data.

It is important to note that enriching the data quality through a variety of curation strategies is just one mitigation that should be used in combination with others, such as Retrieval Augmentation, Fine-tuning, Factual Grounding and monitoring (discussed above). may not completely eliminate hallucinations, as the probabilistic and autoregressive nature of the LLM architectures can still induce errors or generate inconsistencies during inference. Starting top of the funnel with a curation strategy that creates a high-quality and well-curated pre-training or fine-tuning dataset, will increase the reliability and trustworthiness of your resulting LLM model outputs.



Ali Arsanjani

Director Google, AI/ML & GenAI| EX: WW Tech Leader, Chief Principal AI/ML Solution Architect, AWS | IBM Distinguished Engineer and CTO Analytics & ML