Open-Awesome
CategoriesAlternativesStacksSelf-HostedExplore
Open-Awesome

© 2026 Open-Awesome. Curated for the developer elite.

TermsPrivacyAboutGitHubRSS
  1. Home
  2. Biomedical Information Extraction
  3. BioGPT

BioGPT

MITPython

A domain-specific generative language model pre-trained on biomedical literature for text generation and mining tasks.

GitHubGitHub
4.5k stars482 forks0 contributors

What is BioGPT?

BioGPT is a generative pre-trained transformer model specifically designed for biomedical text generation and mining. It is trained on large-scale biomedical literature to understand and generate domain-specific text, enabling tasks like relation extraction, question answering, and document classification in the biomedical domain.

Target Audience

Researchers, data scientists, and developers working in biomedical natural language processing, healthcare AI, and life sciences who need domain-specific language models for text analysis and generation.

Value Proposition

BioGPT offers a specialized model that outperforms general-purpose LLMs on biomedical tasks due to its domain-specific pre-training, and it is openly available with integration into popular frameworks like Hugging Face for easy adoption.

Overview

BioGPT is a generative pre-trained transformer model specifically designed for biomedical text generation and mining. It leverages large-scale biomedical literature to understand and generate domain-specific text, enabling advanced natural language processing applications in healthcare and life sciences.

Key Features

  • Biomedical Pre-training — Trained on PubMed abstracts and articles for domain-specific language understanding.
  • Text Generation — Generates coherent biomedical text, such as research summaries or hypothesis descriptions.
  • Relation Extraction — Identifies relationships between biomedical entities like drug-target interactions.
  • Question Answering — Answers biomedical questions based on contextual knowledge from literature.
  • Document Classification — Classifies biomedical documents into relevant categories.
  • Hugging Face Integration — Available through the transformers library for easy deployment and experimentation.

Philosophy

BioGPT focuses on bridging the gap between general-purpose language models and domain-specific needs by providing a model that understands the nuances and terminology of biomedical literature.

Use Cases

Best For

  • Extracting drug-target interactions from biomedical literature
  • Answering biomedical questions based on PubMed abstracts
  • Generating summaries or hypotheses for biomedical research
  • Classifying biomedical documents into predefined categories
  • Identifying relationships between chemical and disease entities
  • Building biomedical NLP applications with pre-trained domain knowledge

Not Ideal For

  • Projects requiring general-purpose language understanding across multiple non-biomedical domains
  • Real-time applications where low-latency inference is critical due to model size and computational demands
  • Teams with limited machine learning infrastructure for handling large model deployments and complex dependency setups
  • Environments with strict resource constraints, such as mobile or edge devices without GPU support

Pros & Cons

Pros

Domain-Specific Pre-training

Trained on PubMed abstracts and articles, BioGPT outperforms general models on biomedical tasks like relation extraction and QA, as evidenced by its fine-tuned checkpoints and demos.

Task-Ready Fine-Tuning

Provides pre-fine-tuned models for key downstream tasks such as drug-target interaction extraction and document classification, reducing development time and effort.

Hugging Face Integration

Available through the transformers library with pipelines for easy text generation and feature extraction, as shown in the README with code examples for causal language modeling.

Open and Accessible

MIT-licensed with models hosted on GitHub and Hugging Face, promoting reproducibility and adoption in academic and industrial settings.

Cons

Complex Installation Process

Requires manual setup of specific versions for PyTorch 1.12.0, fairseq 0.12.0, and tools like Moses and fastBPE, with environment variable configuration, increasing setup time and potential for errors.

Limited to Biomedical Domain

Specialized training means it underperforms on non-biomedical text without additional fine-tuning, limiting its versatility for broader NLP applications.

High Resource Demands

Models like BioGPT-Large have significant computational and memory requirements, making them unsuitable for low-resource deployments without high-end GPUs.

Outdated Dependencies

Relies on older library versions (e.g., PyTorch 1.12.0), which may cause compatibility issues with newer systems and frameworks, requiring careful environment management.

Frequently Asked Questions

Quick Stats

Stars4,489
Forks482
Contributors0
Open Issues65
Last commit1 year ago
CreatedSince 2022

Tags

#relation-extraction#transformer#biomedical-nlp#large-language-model#text-generation#question-answering#document-classification#huggingface#pytorch#healthcare-ai

Built With

s
scikit-learn
P
Python
P
PyTorch

Included in

Biomedical Information Extraction425Computational Biology122
Auto-fetched 1 day ago

Related Projects

ClawBioClawBio

🦖 ClawBio - The first bioinformatics-native AI agent skill library. Local-first. Reproducible. Built on OpenClaw.

Stars929
Forks192
Last commit1 day ago
GeneGPTGeneGPT

Code and data for GeneGPT.

Stars427
Forks34
Last commit1 year ago
GenePTGenePT

GenePT is a foundation model for single-cell biology that leverages ChatGPT embeddings of NCBI gene descriptions to perform gene-level and cell-level tasks. It offers an efficient alternative to traditional models that require extensive data curation and resource-intensive training from gene expression profiles. ## Key Features - **Gene Embeddings** — Uses GPT-3.5 embeddings of NCBI gene summary texts to represent genes. - **Cell Embeddings** — Generates single-cell embeddings by averaging gene embeddings weighted by expression or creating sentence embeddings from ordered gene names. - **Efficient Approach** — Eliminates the need for dataset curation and additional pre-training, making it user-friendly. - **Competitive Performance** — Achieves comparable or superior performance to existing single-cell foundation models in tasks like gene property classification and cell type annotation. - **Pre-computed Data** — Provides readily available datasets including extracted NCBI gene summaries and pre-computed OpenAI embeddings. ## Philosophy GenePT demonstrates that using large language model embeddings of scientific literature is a straightforward and effective approach for developing biological foundation models, complementing traditional expression-based methods.

Stars318
Forks47
Last commit2 years ago
MolT5MolT5

Associated Repository for "Translation between Molecules and Natural Language"

Stars194
Forks22
Last commit2 years ago
Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a projectStar on GitHub