An R package for joining data frames on inexact matching using string distance, regex, numeric tolerance, and other fuzzy criteria.
fuzzyjoin is an R package that provides functions for joining data frames based on inexact matching criteria rather than exact equality. It solves the common data cleaning problem where identifiers across datasets have slight variations, such as misspellings, formatting differences, or numeric approximations. The package integrates with dplyr and supports multiple matching methods including string distance, regular expressions, and geographic proximity.
R data scientists and analysts who need to combine datasets with imperfectly matching keys, particularly those working with text data, survey responses, geographic data, or any data requiring fuzzy matching logic.
Developers choose fuzzyjoin because it extends the familiar dplyr join syntax to handle real-world data inconsistencies, provides multiple matching algorithms in one consistent interface, and integrates seamlessly with the tidyverse ecosystem for efficient data manipulation workflows.
Join tables together on inexact matching
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Functions like stringdist_inner_join mirror dplyr's join syntax, allowing easy incorporation into existing tidyverse workflows without learning new paradigms, as shown in the pipe-friendly examples.
Supports multiple fuzzy criteria including string distances, regex patterns, and geographic proximity, covering common data cleaning scenarios like misspelling correction and text classification.
The fuzzy_join wrapper enables developers to define bespoke matching functions, offering flexibility for domain-specific joining needs beyond the built-in methods.
Optional distance_col argument adds a column with the calculated distance, aiding in match quality assessment and threshold tuning, as demonstrated in the stringdist example.
Fuzzy matching algorithms, especially string distance calculations, are computationally intensive and can be slow with large datasets, as the README admits no shortcuts are yet implemented for common cases like length differences.
While basic examples are provided, guidance on optimizing joins or implementing complex custom functions is limited, which may require trial and error or deeper package knowledge.
Relies on packages like stringdist for metrics, introducing potential version conflicts or maintenance issues if dependencies change or lack specific features needed for custom joins.