A library for parsing and querying XML data with Apache Spark SQL and DataFrames.
spark-xml is an open-source library that provides XML data source functionality for Apache Spark, allowing users to read and write XML files as Spark DataFrames. It solves the problem of processing large, format-free XML files in a distributed manner, which isn't fully supported by Spark's built-in data sources. The library offers extensive configuration options for handling XML complexities like attributes, namespaces, and schema inference.
Data engineers and data scientists working with XML data in Apache Spark pipelines, particularly those needing to process large XML datasets in distributed computing environments.
Developers choose spark-xml because it provides production-ready XML processing for Spark with robust error handling, schema flexibility, and multi-language API support. Its planned integration into Apache Spark 4.0 ensures long-term compatibility and community support.
XML data source for Spark SQL and DataFrames
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Enables reading and writing XML files in distributed filesystems as Spark DataFrames, overcoming the format restrictions of Spark's built-in JSON data source for format-free XML.
Supports both schema inference with configurable sampling ratio and explicit user-defined schemas, allowing precise control over data types and structure.
Offers multiple parsing modes (PERMISSIVE, DROPMALFORMED, FAILFAST) to gracefully handle corrupt records, with options to store malformed data in a dedicated column like _corrupt_record.
Provides optional validation of XML rows against XSD schemas using rowValidationXSDPath and includes a utility to extract schemas from XSD files, enhancing data quality checks.
The XSDToSchema utility only supports simple, complex, and sequence types with basic functionality, making it inadequate for complex XML schemas with advanced features.
XML parsing is not fully namespace-aware, and the ignoreNamespace option cannot be applied to the rowTag element, which can cause parsing issues in documents with namespace prefixes.
Advanced functions like from_xml are only available in the Scala API, requiring Python developers to write custom helper functions to access these features, as detailed in the README's Pyspark notes.