A large collection of real-world system log datasets for AI-driven log analytics research.
Loghub is a large collection of system log datasets curated for AI-driven log analytics research. It provides real-world, often unsanitized logs from various systems like Hadoop, Spark, Windows, and Android to enable realistic experimentation. The project addresses the need for high-quality, diverse log data to advance research in log parsing, anomaly detection, and other analytics tasks.
Researchers and academics working on AI-driven log analytics, anomaly detection, log parsing, and system reliability. It is also valuable for industry practitioners benchmarking log analysis techniques.
Loghub offers a unique, centralized repository of authentic log datasets that are freely accessible and minimally modified, providing a realistic foundation for research. Its diverse sources and labeled subsets make it a go-to benchmark for the log analytics community.
A large collection of system log datasets for AI-driven log analytics [ISSRE'23]
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Includes logs from distributed systems, supercomputers, operating systems, mobile systems, and server applications—such as Hadoop, Spark, Windows, and Android—as detailed in the README table, enabling cross-domain research.
Logs are minimally modified and often unsanitized to preserve real-world characteristics, as stated in the philosophy, providing a realistic foundation for meaningful log analytics advancements.
Several datasets like HDFS_v1, BGL, and Thunderbird come with labels, directly supporting supervised tasks such as anomaly detection without the need for manual annotation.
With datasets ranging from megabytes to gigabytes and millions to billions of log lines, all freely available via Zenodo links, it facilitates extensive academic research without financial barriers.
The unsanitized logs may contain sensitive information, making them unsuitable for projects requiring data privacy compliance or public sharing, and posing ethical handling challenges.
Not all datasets are labeled; for example, Hadoop's labels require checking an issue (#56), and many like Spark are unlabeled, necessitating additional manual effort for supervised learning tasks.
Datasets are provided as static, large file downloads (e.g., up to 29.6GB for Thunderbird), not as live streams, limiting research on dynamic log analysis or real-time system monitoring.