A spaCy pipeline and models specifically designed for processing scientific and biomedical documents.
SciSpaCy is a specialized natural language processing library built on the spaCy framework, designed specifically for processing scientific and biomedical documents. It provides custom pipelines, tokenizers, and models trained on biomedical data to handle the unique linguistic characteristics of technical literature. The project solves the problem of poor performance when applying general-purpose NLP models to specialized scientific domains.
Researchers, data scientists, and developers working with biomedical literature, clinical text, or scientific documents who need accurate NLP capabilities like entity recognition, linking, and abbreviation detection.
Developers choose SciSpaCy because it offers domain-specific models that significantly outperform general-purpose NLP tools on scientific text, includes ready-to-use components for common biomedical NLP tasks, and integrates seamlessly with the popular spaCy ecosystem.
A full spaCy pipeline and models for scientific/biomedical documents.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Models are trained on biomedical corpora like CRAFT and BC5CDR, explicitly addressing the performance drop of general-purpose models on scientific text, as highlighted in the philosophy.
Builds directly on spaCy, allowing seamless use of existing spaCy APIs and tools, with custom components like abbreviation detection and entity linking that extend the pipeline.
Includes unique components such as abbreviation resolution using the Schwartz & Hearst algorithm and entity linking to ontologies like UMLS and MeSH, tailored for technical literature.
Can be extended to new databases and ontologies using external pyobo integration, as shown in the README example for linking genes to HGNC identifiers.
Requires separate downloads for large models (e.g., UMLS linker is ~1GB) with strict version compatibility, and the demo runs an older version, potentially misrepresenting current capabilities.
Focused exclusively on English biomedical text, lacking support for other languages or non-scientific domains, which restricts its applicability.
The EntityLinker uses basic string overlap matching (char-3grams), which may offer lower precision compared to more sophisticated embedding-based approaches, as admitted in the configuration notes.