Parquet

25 projects

Showing 25 of 25 projects

duckdbC++

An in-process analytical SQL database management system designed for high-performance data analysis.

#parquet#python-integration#database

A scalable time series database optimized for real-time metrics, events, and analytics with fast query response.

#influxql#event-processing#parquet

A scalable time series database optimized for real-time metrics, events, and analytics with fast query response.

#influxql#parquet#sql-engine

An open-source storage framework that enables building a Lakehouse architecture with ACID transactions and scalable metadata handling.

#apache-spark#parquet#data-versioning

Stars8.9k

Forks2.1k

Last commit22 hours ago

dsqGo

A command-line tool for running SQL queries against JSON, CSV, Excel, Parquet, and other structured data files.

#parquet#cli-tool#openoffice-calc

Stars3.9k

Forks162

Last commit2 years ago

QSVRust

A blazing-fast command-line toolkit for querying, slicing, analyzing, transforming, and validating tabular data (CSV, Excel, JSONL, etc.).

#ckan#parquet#luau

Stars3.7k

Forks104

Last commit20 hours ago

TabiewRust

A lightweight TUI application for viewing and querying tabular data files like CSV, Parquet, and JSON with SQL support.

#terminal-application#parquet#tui

Stars3.0k

Forks85

Last commit2 days ago

GafferJava

A graph database framework for storing and querying large-scale graphs with rich properties and in-database aggregation.

#apache-spark#parquet#entity-relation

Stars1.8k

Forks363

Last commit1 year ago

TonboRust

An embedded database for serverless and edge runtimes, storing data as Parquet on S3 with stateless compute.

#parquet#database#offline-first

Stars1.6k

Forks100

Last commit6 days ago

ADAMScala

A genomics analysis platform that uses Apache Spark to parallelize genomic data processing across clusters, replacing traditional file-based workflows.

#genomic-data#apache-spark#parquet

Stars1.1k

Forks312

Last commit4 months ago

Cinchoo ETLC#

A simple, fast, and flexible ETL framework for .NET with built-in readers and writers for CSV, JSON, XML, Parquet, and more.

#parquet#cinchoo-etl#flat

Stars859

Forks141

Last commit1 month ago

Ookla internet speed dataJupyter Notebook

Global open dataset of aggregated fixed and mobile network performance metrics (download/upload/latency) in geospatial tiles.

#parquet#geospatial-data#gis

Stars310

Forks59

Last commit3 months ago

parquetGo

A Go library that generates type-safe Parquet readers and writers from Go structs or existing Parquet files.

#parquet#data-serialization#dremel

Stars127

Forks13

Last commit1 year ago

Scylla-MigratorScala

A Spark application for migrating data to ScyllaDB from CQL-compatible databases or DynamoDB via Alternator.

#apache-spark#parquet#migration

Stars73

Forks50

Last commit7 days ago

ByteHubPython

An easy-to-use Python feature store for machine learning, optimized for timeseries data and built on Dask.

#parquet#data-science#data-engineering

Stars61

Forks4

Last commit5 years ago

ParquetPHP

A pure PHP library for reading and writing Parquet columnar storage files without external dependencies.

#parquet#file-format#data-engineering

Stars60

Forks3

Last commit11 days ago

insyraGo

A next-generation data analysis library for Go, offering parallel processing, data visualization, and seamless Python integration as an alternative to Pandas.

#parquet#python-integration#plot

Stars54

Forks2

Last commit6 days ago

cl-duckdbCommon Lisp

Common Lisp CFFI wrapper around the DuckDB C API

#parquet#lisp#data-science

Stars53

Forks2

Last commit1 month ago

A Whirlwind Tour of Common Crawl's Datasets using PythonPython

A Python tutorial demonstrating how to access and process Common Crawl's web archive datasets (WARC, WET, WAT) using tools like warcio, cdx_toolkit, and DuckDB.

#data-indexing#parquet#cdx-index

Stars45

Forks9

Last commit

/load-llms-txtJupyter Notebook

A public dataset of Ethereum network events including beacon chain, mempool, and canonical chain data for analysis.

#parquet#mev#p2p-network

Stars33

Forks4

Last commit21 hours ago

aircraft-flight-schedules

Global flight schedule datasets extracted from ADS-B position transmissions, published quarterly from 2024 onwards.

#parquet#aviation-data#python

Stars26

Forks0

Last commit11 days ago

GTFS-Realtime-CapsulePython

A command-line tool that scrapes, normalizes, and archives real-time public transit (GTFS Realtime) data for historical analysis.

#parquet#data-scraping#data-archiving

Stars16

Forks3

Last commit11 months ago

Bread Dataset ViewerTypeScript

A VS Code extension for viewing large datasets (JSONL/Parquet/CSV) instantly without crashes, with 16 production LLM tokenizers for accurate token counting.

#pyarrow#parquet#dataset-viewer

Stars7

Forks2

Last commit