A CLI tool and dataflow engine that lets you query and join data from multiple databases and file formats using SQL.
OctoSQL is a command-line tool and dataflow engine that provides a unified SQL interface for querying, joining, and transforming data from multiple databases and file formats. It solves the problem of data fragmentation by allowing users to run SQL queries across heterogeneous sources like JSON files, CSV, Parquet, and relational databases as if they were a single database.
Data engineers, analysts, and developers who need to query and join data across multiple formats and databases without complex ETL pipelines, especially those working with streaming data or ad-hoc data analysis.
Developers choose OctoSQL for its ability to seamlessly join data across different sources using standard SQL, its extensible plugin architecture, and its built-in streaming capabilities with strong consistency guarantees, all while offering competitive performance for direct file queries.
OctoSQL is a query tool that allows you to join, analyse and transform data from multiple databases and file formats using SQL.
Enables JOIN operations between disparate sources like CSV files and PostgreSQL tables using standard SQL, eliminating the need for manual ETL pipelines.
Handles infinite streams with event-time processing, watermarks, and internally consistent outputs, making it suitable for real-time aggregations and windowed queries.
Allows adding support for new databases (e.g., PostgreSQL, MySQL) via installable plugins, with a SQL interface for browsing and managing plugins.
Features union types, type assertions, and conversion functions (e.g., int(text)), providing robustness for heterogeneous and messy data schemas.
Offers visual query plans with predicate pushdown and join strategy selection (Stream Join, Lookup Join), helping users understand and tune performance.
Requires manual plugin installation and YAML configuration for databases, adding complexity compared to tools with built-in connectors.
Benchmarks show it's slower than DataFusion for CSV queries and relies on caching for competitive speeds, indicating limitations in raw throughput.
Lacks a graphical user interface or web-based IDE, making it less accessible for non-technical users or collaborative workflows.
The plugin repository is limited compared to established frameworks like Apache Spark, and external contributions to core code are not accepted.
SQL powered operating system instrumentation, monitoring, and analytics.
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Zero-ETL, infinite possibilities. Live query APIs, code & more with SQL. No DB required.
Data pipelines for cloud config and security data. Build cloud asset inventory, CSPM, FinOps, and vulnerability management solutions. Extract from AWS, Azure, GCP, and 70+ cloud and SaaS sources.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.