A Python library for generating high-quality synthetic tabular data using GANs, diffusion models, and large language models.
TabGAN is a Python library for generating high-quality synthetic tabular data using multiple generative approaches, including GANs, diffusion models, and large language models. It solves the problem of data scarcity, privacy concerns, and the need for realistic data augmentation by producing synthetic datasets that preserve the statistical properties of the original data.
Data scientists, machine learning engineers, and researchers working with tabular data who need synthetic data for model training, testing, or privacy-preserving data sharing.
Developers choose TabGAN for its unified API that abstracts multiple state-of-the-art generative methods, built-in quality validation and privacy metrics, and seamless integration with popular tools like HuggingFace and scikit-learn, making it a comprehensive and practical solution for synthetic data generation.
We well know GANs for success in the realistic image generation. However, they can be applied in tabular data generation. We will review and examine some recent papers about tabular GANs in action.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Switches between GANs, diffusion models, and LLMs via a single parameter, as shown in the quick start with generators like GANGenerator and LLMGenerator.
Includes adversarial filtering with LightGBM and comprehensive HTML quality reports to ensure synthetic data fidelity, demonstrated in the Quality Report section.
Integrates directly with HuggingFace for dataset synthesis and scikit-learn pipelines via TabGANTransformer, making it practical for existing ML workflows.
LLMGenerator enables novel text column generation conditioned on categorical attributes using LLM prompting, as detailed in the conditional text example.
Methods like Forest Diffusion take up to 45 seconds in benchmarks, and LLMs require GPU resources, making it resource-intensive for large-scale or frequent use.
LLMGenerator relies on external APIs (e.g., OpenAI, LM Studio) or local models, adding configuration steps and potential costs, as noted in the LLM API section.
Time-series support is basic via date preprocessing utilities, but it doesn't inherently model complex temporal patterns like autoregressive methods would.