A curated collection of datasets, corpora, and resources for Indonesian natural language processing tasks.
Indonesian NLP resources is a curated collection of datasets, corpora, and benchmarks for natural language processing tasks in the Indonesian language. It aggregates resources for language modeling, sentiment analysis, machine translation, speech recognition, and other NLP domains. The project serves as a centralized reference to support research and development in Indonesian language technologies.
Researchers, data scientists, and developers working on Indonesian natural language processing, machine learning, or linguistics projects. It is particularly useful for those building or evaluating Indonesian language models, translation systems, or speech recognition tools.
It saves significant time by aggregating scattered Indonesian NLP resources into one organized list, complete with descriptions and citations. Unlike generic NLP resource lists, it focuses specifically on Indonesian, addressing the unique challenges of low-resource language processing.
A list of Indonesian NLP resources.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Organizes resources by specific NLP tasks like language modeling, sentiment analysis, and speech recognition, as shown in the structured categorization with sections for each task in the README.
Each entry includes academic citations and source links, such as paper links for datasets like IDENTICv1.0, ensuring proper attribution and facilitating deeper research.
Aggregates Indonesian-specific resources, addressing gaps for a language with fewer datasets than English, including benchmarks like IndoNLU for standardized evaluation.
Features extensive datasets like OSCAR and CC-100 with billions of tokens, providing substantial data for training robust language models.
It's solely a curated list without code examples, tools, or integration guides, forcing users to independently handle data downloading, preprocessing, and model training.
Some resources have restrictive licenses, such as the TITML-IDN speech corpus requiring formal academic requests, complicating commercial or quick usage.
Datasets range from professionally curated ones to small, disclaimer-laden projects like the 'Indonesian Speech Recognition' school project, which may not be reliable for production tasks.