A Spark library for reading from and writing to Google BigQuery using DataFrames and SQL.
spark-bigquery is an open-source library that enables Apache Spark to read data from and write data to Google BigQuery directly. It provides Spark SQL and DataFrame APIs for interacting with BigQuery tables, allowing users to run queries and perform distributed data processing without manual data transfers. The library handles GCP authentication, data type mappings, and optimizations for BigQuery's storage system.
Data engineers and data scientists working in GCP environments who use Spark for large-scale data processing and need to integrate with BigQuery datasets. It's particularly useful for teams running Spark on Google Cloud Dataproc clusters.
Developers choose spark-bigquery for its native Spark integration, which simplifies ETL pipelines by eliminating the need for intermediate storage when moving data between Spark and BigQuery. It offers a programmatic alternative to manual BigQuery exports/imports, though note the project is in maintenance mode with best-effort support.
Google BigQuery support for Spark, SQL, and DataFrames
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Loads BigQuery tables directly into Spark DataFrames, eliminating manual data exports and simplifying ETL workflows.
Supports configurable GCP credentials and project settings via methods like setGcpJsonKeyFile, easing secure access setup.
Manages Avro namespaces for writing nested records to BigQuery, with specific configuration options detailed in the README.
Enables direct DataFrame reads and writes to BigQuery tables, reducing pipeline complexity without intermediate storage.
The project is in maintenance mode with best-effort support, leading to delayed responses and halted active development.
Supports only the legacy SQL dialect for BigQuery queries, which is outdated and may not meet modern SQL standards.
Has known limitations like unsupported arrays of arrays, as admitted in the README, restricting complex data handling.
Requires careful Avro namespace configuration for nested records, adding setup complexity and potential for errors.