A resource and evaluation framework for benchmarking link prediction models on large-scale, heterogeneous biomedical knowledge graphs.
OpenBioLink is a resource and evaluation framework designed for benchmarking link prediction models on heterogeneous biomedical knowledge graphs. It provides large-scale, openly available datasets and standardized tools to create custom benchmarks and evaluate model performance, addressing the need for reproducible and fair comparisons in biomedical graph machine learning.
Researchers and data scientists working on biomedical knowledge graph completion, link prediction, and graph machine learning, particularly those needing standardized benchmarks for model evaluation.
Developers choose OpenBioLink for its comprehensive, large-scale benchmark datasets that minimize information leakage, include true negative edges, and support heterogeneous graphs, along with its open-source evaluation framework that ensures reproducibility and comparability across studies.
OpenBioLink is a resource and evaluation framework for evaluating link prediction models on heterogeneous biomedical graph data.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Provides four dataset variants with over 5 million edges each, including directed/undirected and quality-filtered options, ensuring comprehensive evaluation for diverse research needs.
Implements post-processing to remove trivially inferable relations like reverse edges and super-properties from test sets, leading to more realistic and challenging predictions.
Incorporates explicitly stated true negative edges for certain relation types, reducing noise compared to random sampling and improving evaluation accuracy.
Offers GUI and CLI tools for generating custom graphs and performing train-test splits, supporting tailored datasets and research scenarios.
Requires manual setup of virtual environments, specific Python versions (e.g., only Python 3.6 on Windows), and separate PyTorch installation, which can be time-consuming and error-prone.
Datasets aggregate data from multiple external sources with varying licenses (e.g., CC BY-NC-SA, custom terms), creating legal complexities for redistribution and commercial use.
The README admits gaps, such as for time-slice splits where 'more documentation will be provided later,' potentially hindering advanced usage and user adoption.