A C++ library for parallel text file reading with CSV support and Python bindings.
ParaText is a C++ library that reads text files in parallel across multiple CPU cores, designed to accelerate data loading for large datasets. It provides a CSV reader with Python bindings, allowing integration with tools like Pandas for efficient data analysis. The library solves the problem of slow sequential file reading by leveraging multi-core architectures to reduce ingestion times.
Data scientists, engineers, and developers working with large CSV or text datasets who need faster data loading and memory-efficient processing, especially in Python environments with Pandas.
Developers choose ParaText for its parallel reading capabilities, which significantly speed up data ingestion compared to single-threaded alternatives, and its flexible, memory-efficient output options like dictionary-based loading for tight memory budgets.
A library for reading text files over multiple cores.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Uses multiple CPU cores to read and parse files concurrently, significantly reducing load times for large datasets, as emphasized in the README's performance focus.
Supports loading into Python dictionaries of arrays, which consumes less memory than Pandas DataFrames, ideal for tight memory budgets, as demonstrated in the hepatitis.csv example.
Distinguishes between data types and semantic interpretations (numeric, categorical, text), allowing users to override inferred types with parameters like cat_names and text_names.
Can handle CSV files with quoted newlines in fields when explicitly enabled via allow_quoted_newlines=True, providing flexibility for messy data formats.
Supports a wide range of escape sequences, including Unicode code points and special characters, ensuring robust parsing of complex text, as detailed in the escape characters section.
The library is in alpha release with unimplemented features, such as no DateTime support and inability to supply exact data type hints, only semantic overrides.
Requires specific dependencies like C++11 compiler, SWIG versions, and Python packages, which can be cumbersome to set up, especially on non-Linux systems or older environments.
Enabling multi-line field support adds extra overhead to adjust chunk boundaries, as noted in the README, which may reduce parallel efficiency for certain files.
Currently only supports CSV files, with no mention of other text formats like JSON or TSV, restricting its use to tabular data ingestion scenarios.