How does GeneGPT compare to BioGPT for biomedical questions?

GeneGPT significantly outperforms BioGPT on GeneTuring tasks, with an average score of 0.83 vs. BioGPT's 0.04, due to its API integration for accurate data retrieval. However, BioGPT might be better for generative tasks without tool use, as GeneGPT focuses on retrieval-augmented responses.

How to install and run GeneGPT locally?

Install Python 3.9.13, run 'pip install -r requirements.txt', add your OpenAI API key to config.py, and execute commands like 'python main.py 111111' for full GeneGPT or 'python main.py 001001' for the slim version, as detailed in the README.

Is GeneGPT free to use for biomedical research?

The code is open-source, but it requires an OpenAI API key which incurs costs based on usage. Additionally, NCBI APIs may have usage limits, so check their policies for large-scale applications.

Can GeneGPT handle real-time queries in a production environment?

It can process queries, but performance depends on API call latency and OpenAI's response times, making it less suitable for high-throughput or real-time applications without significant optimization and caching.

What types of biomedical questions can GeneGPT answer?

It excels on GeneTuring tasks like gene alias, disease associations, and SNP locations, but may struggle with tasks requiring non-NCBI data or very specific alignments, as seen in variable evaluation scores.

How to extend GeneGPT to use other tools or APIs?

The current implementation is tightly coupled with NCBI APIs; extending it would require modifying the code to integrate new tool demonstrations and API call handling, which is non-trivial and not documented.

GeneGPT

NOASSERTIONPython

A tool-augmented LLM that uses NCBI Web APIs to answer biomedical questions with high accuracy and reduced hallucinations.

Visit Website

What is GeneGPT?

GeneGPT is a tool-augmented large language model specifically designed for biomedical information retrieval. It enhances LLMs' ability to answer specialized biomedical questions by teaching them to use NCBI Web APIs, significantly reducing hallucinations and improving accuracy compared to general-purpose models. The system achieves state-of-the-art performance on biomedical question-answering tasks through in-context learning and a novel API call execution algorithm.

Target Audience

Bioinformatics researchers, computational biologists, and developers working on biomedical AI applications who need accurate, API-backed answers to specialized biological and genetic questions.

Value Proposition

GeneGPT provides significantly higher accuracy on biomedical tasks than general LLMs or specialized biomedical models by directly integrating with authoritative NCBI databases, offering a reliable solution for information retrieval in a domain where factual correctness is critical.

Overview

Code and data for GeneGPT.

Use Cases

Best For

Answering gene-related questions with verified NCBI data

Related Projects

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub

GitHub

428 stars34 forks0 contributors

Biomedical researchers needing accurate API-backed information retrieval

Building specialized AI assistants for biological databases

Reducing hallucinations in biomedical question-answering systems

Multi-hop biomedical queries requiring chained API calls

Benchmarking tool-augmented LLMs in specialized domains

Not Ideal For

General-purpose conversational AI projects not focused on biomedicine
Applications requiring offline or real-time processing without external API calls
Teams without budget for OpenAI API services or NCBI usage limits
Projects needing integration with non-NCBI biomedical databases or custom data sources

Pros & Cons

Pros

State-of-the-Art Accuracy

Achieves an average score of 0.83 on GeneTuring tasks, vastly outperforming models like New Bing (0.44) and BioGPT (0.04), as shown in the evaluation results.

Hallucination Reduction

Minimizes incorrect information by executing API calls to NCBI databases, addressing LLM challenges in specialized knowledge areas, as emphasized in the introduction.

Multi-Hop Query Capability

Can handle complex queries requiring chains of API calls, demonstrated by its ability to generalize to longer sequences in multi-hop question answering.

Authoritative Data Integration

Directly leverages NCBI Web APIs for accessing trusted biomedical databases, ensuring reliable and up-to-date information retrieval.

Cons

OpenAI API Dependency

Requires an OpenAI API key to run with Codex, introducing ongoing costs and vendor lock-in, as specified in the setup instructions.

Limited Data Source Scope

Only integrates with NCBI Web APIs, so it cannot handle queries requiring data from other biomedical databases or custom sources, restricting flexibility.

Inconsistent Performance

Evaluation results show wide variance in accuracy across tasks, from 0.44 for Human genome DNA alignment to perfect scores, indicating potential reliability issues in certain scenarios.

Frequently Asked Questions

Home

Computational Biology

BioGPT

BioGPT is a generative pre-trained transformer model specifically designed for biomedical text generation and mining. It leverages large-scale biomedical literature to understand and generate domain-specific text, enabling advanced natural language processing applications in healthcare and life sciences. ## Key Features - **Biomedical Pre-training** — Trained on PubMed abstracts and articles for domain-specific language understanding. - **Text Generation** — Generates coherent biomedical text, such as research summaries or hypothesis descriptions. - **Relation Extraction** — Identifies relationships between biomedical entities like drug-target interactions. - **Question Answering** — Answers biomedical questions based on contextual knowledge from literature. - **Document Classification** — Classifies biomedical documents into relevant categories. - **Hugging Face Integration** — Available through the transformers library for easy deployment and experimentation. ## Philosophy BioGPT focuses on bridging the gap between general-purpose language models and domain-specific needs by providing a model that understands the nuances and terminology of biomedical literature.

Stars4,489

Forks481

Last commit2 years ago

ClawBio

🦖 ClawBio - The first bioinformatics-native AI agent skill library. Local-first. Reproducible. Open. Free.

Stars1,045

Forks228

Last commit2 days ago

GenePT

GenePT is a foundation model for single-cell biology that leverages ChatGPT embeddings of NCBI gene descriptions to perform gene-level and cell-level tasks. It offers an efficient alternative to traditional models that require extensive data curation and resource-intensive training from gene expression profiles. ## Key Features - **Gene Embeddings** — Uses GPT-3.5 embeddings of NCBI gene summary texts to represent genes. - **Cell Embeddings** — Generates single-cell embeddings by averaging gene embeddings weighted by expression or creating sentence embeddings from ordered gene names. - **Efficient Approach** — Eliminates the need for dataset curation and additional pre-training, making it user-friendly. - **Competitive Performance** — Achieves comparable or superior performance to existing single-cell foundation models in tasks like gene property classification and cell type annotation. - **Pre-computed Data** — Provides readily available datasets including extracted NCBI gene summaries and pre-computed OpenAI embeddings. ## Philosophy GenePT demonstrates that using large language model embeddings of scientific literature is a straightforward and effective approach for developing biological foundation models, complementing traditional expression-based methods.

Stars321

Forks47

Last commit2 years ago

MolT5

Associated Repository for "Translation between Molecules and Natural Language"

Stars194

Forks20

Last commit2 years ago

#biomedical-research

#question-answering

#large-language-models

Computational Biology122