Question 1

How do I get started with Indonesian NLP using these resources?

Accepted Answer

Identify your task, such as sentiment analysis, and check the relevant section in the list for datasets like the Aspect and Opinion Terms Extraction for Hotel Reviews. Download the data and use frameworks like Hugging Face with the provided citations for implementation guidance.

Question 2

What's the best dataset for training an Indonesian BERT model?

Accepted Answer

The IndoNLU Benchmark corpus is ideal—it's large (4B words) and specifically designed for Indonesian, with pre-trained models available on Hugging Face. Alternatives like OSCAR or CC-100 also offer billions of tokens for broad coverage.

Question 3

Indonesian NLP resources vs scraping my own data: which is better?

Accepted Answer

Using this list saves time with vetted, cited datasets, but for domain-specific needs, scraping might be necessary. The curated resources ensure quality and reproducibility, while scraping offers customization at the cost of effort and validation.

Question 4

How can I contribute new datasets to this list?

Accepted Answer

Fork the GitHub repository, add your resource with proper descriptions and citations following the existing format, and submit a pull request. Ensure it fits the categorization, such as under 'Language modeling' or 'Sentiment analysis'.

Question 5

Are there pre-trained models for Indonesian speech recognition?

Accepted Answer

No, the list only provides speech datasets like TITML-IDN and CMU Wilderness; you must train your own models using these audio corpora. Check individual links for access procedures and licensing.

Question 6

What are common licensing issues with these datasets?

Accepted Answer

Licenses vary widely—OSCAR and CC-100 are open, but others like TITML-IDN require academic permissions. Always review source links; for example, the README notes that TITML-IDN needs a formal request via email for non-commercial use.

Indonesian NLP

What is Indonesian NLP?

Overview

Use Cases

Best For

Related Projects

Found a gem we're missing?

Not Ideal For

Pros & Cons

Pros

Cons

Frequently Asked Questions