A Clojure/ClojureScript library for building self-contained natural language parsers using part-of-speech tagging and semantic rules.
Postagga is a natural language processing library written in pure Clojure and ClojureScript. It allows developers to build custom parsers that convert free-form text into structured data using part-of-speech tagging and semantic rules. The library solves the problem of understanding unstructured user input in applications like chatbots or command-line interfaces without relying on external NLP services.
Clojure and ClojureScript developers building chatbots, command interpreters, or applications requiring natural language understanding. It's particularly useful for those needing lightweight, embeddable parsers that run on both servers and browsers.
Developers choose Postagga because it offers a pure Clojure/ClojureScript solution with no external dependencies, enabling self-contained parsers that are portable and easy to deploy. Its rule-based approach provides fine-grained control over language understanding, unlike black-box NLP APIs.
A Library to parse natural language in pure Clojure and ClojureScript
Postagga compiles into parsers with zero external dependencies, running seamlessly on both JVM Clojure and browser ClojureScript, as highlighted in the README's emphasis on portability.
Developers can define precise semantic rules as state machines to extract structured data from tagged sentences, offering fine-grained control over language understanding, evidenced by the detailed rule examples in the parser section.
Includes ready-to-use models for English and French derived from annotated corpora like Framenet and Free French Treebank, accessible via namespaces for easy embedding in ClojureScript projects.
Enhances part-of-speech tagging by patching unknown words with custom dictionaries (e.g., for proper nouns), improving reliability as described in the patching workflow section.
Defining parser rules involves complex state-machine concepts with steps, states, and keywords like :get-value and :!OR!, which the README admits can be confusing and error-prone for developers.
Only provides pre-trained models for English and French from specific corpora with licensing restrictions, and training new models requires annotated data, limiting out-of-the-box usability for other languages.
Models can be large variables that risk memory issues, as warned in the README ('avoid realizing all of them like printing in your REPL!!!'), and tokenizers may lack optimization for production-scale text processing.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.