Text Processing

#go-application#llm-filtering#cli-tool

AmbrosiaGo

A cross-platform CLI tool for cleaning and improving text datasets for machine learning, with fast operations and LLM-based filtering.

Stars113

Forks2

Microsoft.PowerShell.UnixCompletersC#

A collection of PowerShell modules for remoting, secret management, and text utilities, published to PowerShellGallery.com.

#remoting#text-processing#cmdlets

Stars113

Forks24

#text-analysis#nlp-tools#lemmatization

utf8C

An R package for robust UTF-8 text processing, fixing bugs in R's native Unicode handling.

#emoji#unicode#r-package

Stars112

Forks5

Last commit22 hours ago

lemmatizerRuby

A Ruby gem for lemmatizing English text, converting inflected words to their base dictionary forms.

Stars112

Forks15

#unicode#html-generation#emoji-conversion

emojizeCSS

A Node.js utility that converts Unicode emoji characters to HTML image tags with high-resolution sprites.

Stars110

Forks21

Last commit11 years ago

exmojiElixir

An Elixir/Erlang library providing low-level operations for handling Emoji glyphs in the Unicode standard.

#functional-programming#emoji#elixir

Stars105

Forks29

Last commit1 year ago

Url highlightPHP

A PHP library for parsing, validating, and highlighting URLs in text strings, including HTML and Markdown conversion.

#html-highlighting#regex#linkify

Stars103

Forks1

Last commit2 days ago

FLREPascal

A fast, safe, and efficient regular expression library for Object Pascal with Unicode support and multiple optimized subengines.

#unicode#object-pascal#c-api

Stars103

Forks25

Last commit2 months ago

babelCommon Lisp

A pure Common Lisp library for charset encoding and decoding, similar to GNU libiconv.

#unicode#pure-lisp#encoding-decoding

Stars102

Forks30

Last commit10 months ago

porter-stemmerJavaScript

A Node.js implementation of Martin Porter's stemming algorithm for removing morphological endings from English words.

#commonjs#information-retrieval#natural-language-processing

Stars102

Forks12

#deduplication#command-line-tool#gzip-support

RlingC

A faster, multi-threaded, feature-rich alternative to the rli utility for line removal, deduplication, and frequency analysis on large text files.

Stars101

Forks15

Last commit21 days ago

wildmatchRust

A Rust library for simple string matching with single- and multiple-wildcard operators.

#globbing#rust-lang#matching-algorithm

Stars98

Forks19

Last commit22 days ago

cmarkC

Elixir NIF binding for cmark (C), a CommonMark-compliant Markdown parser library.

#cmark#hex#elixir

Stars97

Forks13

#url-friendly#open-source#slug-generation

go-slugifyGo

A Go library and CLI tool for converting strings into URL-friendly slugs with Unicode support.

Stars97

Forks9

Last commit6 years ago

Parsing With Haskell Parser CombinatorsHaskell

A step-by-step guide to parsing using Haskell parser combinators, with practical examples for version numbers and SRT subtitles.

#parsing#haskell-learning#haskell

A Go library providing utilities for Persian language text processing, including digit conversion, keyboard layout switching, and currency formatting.

#farsi#number#currency-formatting

Stars94

Forks14

#nlp-library#text-analysis#multilingual

pragmatic_tokenizerRuby

A multilingual Ruby gem for splitting strings into tokens with extensive language support and configurable options.

Stars93

Forks11

Last commit1 year ago

ruby-nlpRuby

Ruby bindings for Stanford NLP tools providing part-of-speech tagging and named entity recognition capabilities.

#part-of-speech-tagging#nlp-tools#natural-language-processing

Stars92

Forks14

Last commit12 years ago

goregenGo

A Go library for generating random strings that match a given regular expression pattern.

#developer-tools#regex#go-library

Stars92

Forks24

levenshteinGo

A Go package for calculating Levenshtein distance and similarity metrics with customizable edit costs and prefix bonuses.

Stars92

Forks8

#nlp-library#sentence-boundaries#nltk

punkt-segmenterRuby

A Ruby port of the NLTK Punkt algorithm for unsupervised, language-independent sentence boundary detection.

Stars91

Forks9

Last commit8 years ago

chinese_translationElixir

An Elixir module for translating between simplified and traditional Chinese, converting to pinyin, and slugifying Chinese text.

#pinyin#elixir#unicode

Stars91

Forks11

Last commit8 years ago

Verbal-ExprejonClojure

A Clojure library for building complex regexes using a fluent, composable API without writing regex syntax.

#functional-programming#regex-builder#dsl

Stars90

Forks2

Last commit10 years ago

segmentGo

A Go library for Unicode text segmentation at word boundaries as defined by Unicode Standard Annex #29.

#unicode#word-boundaries#ragel

Stars89

Forks15

#template#crystal-library#template-engine

crustacheCrystal

A Crystal implementation of Mustache logic-less templates, compliant with the Mustache spec v1.1.2+λ.

Stars88

Forks13

#morse-code#library#morse

MarkdownC

An Elixir library that converts Markdown to HTML using a NIF binding to the Hoedown C library.

#elixir#hoedown#library

Stars88

Forks18

Last commit6 years ago

morseGo

A Go library for encoding and decoding text to and from Morse code.

Stars87

Forks11

#elixir#text-processing#decoding

html_entitiesElixir

Elixir module for decoding and encoding HTML entities in strings.

Stars87

Forks22

#web-forms#slug-generation#url-slug

SlugifyHTML

A jQuery plugin that automatically generates URL slugs from input fields as users type, similar to Django's slugify function.

Stars87

Forks40

Last commit10 years ago

alignGo

A Go library and CLI tool for aligning delimited text with customizable justification, padding, and column filtering.

#open-source#library#alignment

Stars84

Forks7

#hex#emoji-analysis#elixir

veritaserumElixir

Simple sentiment analysis for Elixir based on AFINN-165 with emoji, booster, and negator support.

Stars83

Forks10

#phonetic-algorithms#metaphone#similarity-measurement

the_fuzzElixir

A collection of fuzzy string matching algorithms and phonetic metrics for Elixir, including Levenshtein, Jaro-Winkler, Soundex, and more.

Stars82

Forks10

Last commit1 year ago

gounidecodeGo

A Go library for transliterating Unicode text to ASCII equivalents, similar to Python's unidecode.

#ascii-conversion#unicode#internationalization

Stars80

Forks21

Last commit10 years ago

hotwaterRuby

Fast Ruby FFI gem providing C implementations of string edit distance algorithms like Levenshtein, Jaro-Winkler, and N-Gram.

#n-gram#ffi#ruby-gem

Stars80

Forks1

Last commit13 years ago

stopwords-filterRuby

A Ruby gem for filtering stopwords from text with built-in support for multiple languages via Snowball lists.

#stopwords#ruby-gem#natural-language-processing

Stars80

Forks54