A library enabling Apache Spark to read from and write to Apache HBase tables as external data sources using DataFrames and SQL.
Apache Spark - Apache HBase Connector (SHC) is a library that enables Apache Spark to read from and write to Apache HBase tables as external data sources. It allows users to perform Spark SQL queries, DataFrame operations, and Dataset transformations directly on HBase data, bridging the gap between Spark's analytical processing and HBase's scalable storage.
Data engineers and developers working with big data stacks who need to integrate Apache Spark for processing with Apache HBase for storage, particularly in environments requiring SQL-like queries on NoSQL data.
It provides a high-performance, optimized connector that leverages Spark's Catalyst optimizer for predicate pushdown, partition pruning, and data locality, making HBase data accessible through familiar Spark APIs without custom low-level code.
The Apache Spark - Apache HBase Connector is a library to support Spark accessing HBase table as external data source or sink.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Leverages Spark's Catalyst optimizer for predicate pushdown and partition pruning, reducing data transfer by pushing filters to HBase storage layer.
Co-locates Spark executors with HBase region servers when possible, minimizing network overhead and improving scan performance in clustered setups.
Uses JSON catalogs to define table schemas and column mappings, supporting extensible data types like Avro and primitive types for adaptable modeling.
Includes configurations for Kerberos-enabled environments and SHCCredentialsManager for multi-cluster access, essential for enterprise security needs.
The README admits complex data types and composite key support are in the 'TODO' section, limiting functionality for advanced use cases like nested structures.
Kerberos configuration requires manual steps such as setting SPARK_CLASSPATH and managing keytabs, which can be error-prone and increase operational overhead.
Depends on specific HBase versions (e.g., 1.1.2 by default) and may need extra jars for features like Phoenix, adding maintenance and compatibility challenges.