A dataset of millions of news articles labeled by credibility type for training fake news detection algorithms.
Fake News Corpus is an open-source dataset of millions of news articles labeled by credibility type, created by scraping domains from OpenSources.co and supplementing with reliable sources. It's designed specifically for training deep learning algorithms to recognize fake news and other types of unreliable content. The dataset includes 11 categories ranging from fake news and satire to credible sources, formatted as CSV with rich metadata fields.
Machine learning researchers and data scientists working on fake news detection, natural language processing, and credibility analysis algorithms. Academic institutions and organizations developing content moderation or media literacy tools.
Provides a large-scale, pre-labeled dataset specifically curated for fake news detection research, saving researchers the immense effort of collecting and categorizing news articles manually. The inclusion of balanced classes and multiple credibility types makes it more suitable for training robust classification models than smaller or less diverse datasets.
A dataset of millions of news articles scraped from a curated list of data sources.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Contains 9.4 million articles from 745 domains, providing extensive data for training deep learning models, as stated in the README.
Includes 11 types like fake, satire, bias, and reliable based on OpenSources.co, offering a broad spectrum for classification tasks.
Supplements unreliable domains with credible sources from NYTimes and WebHose, improving class balance for more robust model training.
Formatted as CSV with fields like content, authors, keywords, and summary, facilitating easy integration into machine learning pipelines.
Labels are assigned based on domains without manual filtering, leading to potential inaccuracies that could skew model performance, as admitted in the limitations.
The creator does not intend to update the corpus after finalization, making it quickly outdated for applications needing recent news data.
Only about 80% of the data has been cleaned and published, and some URLs may not point to actual articles, posing challenges for reliable analysis.