Question 1

How do I train Wiki2Vec on a newer Wikipedia dump?

Accepted Answer

Use the prepare.sh script with your locale and output folder, but ensure you have the latest dump URL or modify the script. The process involves downloading, cleaning, and tokenizing, but be aware of potential changes in dump formats over time.

Question 2

Wiki2Vec vs spaCy for entity embeddings?

Accepted Answer

Wiki2Vec generates custom Word2Vec vectors with DBpedia IDs from Wikipedia, ideal for knowledge graph integration. spaCy offers pre-trained word vectors but not entity-specific embeddings from structured sources, making Wiki2Vec better for research on entity semantics.

Question 3

Can Wiki2Vec handle entities not in DBpedia?

Accepted Answer

No, it's limited to DBpedia entities derived from Wikipedia articles. The corpus replaces links with DBpedia IDs, so entities outside this scope won't have embeddings without manual extension.

Question 4

What are the memory requirements for running Wiki2Vec with Spark?

Accepted Answer

It depends on dump size; the README suggests using '--executor-memory 1g' in Spark commands, but for large dumps like English Wikipedia, you may need significant RAM and storage, as highlighted in the Java heap settings.

Question 5

Is Wiki2Vec compatible with BERT or transformers?

Accepted Answer

Not directly; Wiki2Vec outputs Word2Vec-style embeddings. You'd need to convert or integrate these into transformer models separately, as it focuses on traditional skip-gram/CBOW training, not contextual embeddings.

Question 6

How to customize stemming or tokenization in Wiki2Vec?

Accepted Answer

Pass 'None' as an argument to disable stemming in scripts or Spark commands, but tokenization is naive per the README. For advanced customization, you'd need to modify the Scala code, which adds complexity.

Wiki2Vec. Getting Word2vec vectors for entities and word from Wikipedia Dumps

What is Wiki2Vec. Getting Word2vec vectors for entities and word from Wikipedia Dumps?

Overview

Use Cases

Best For

Related Projects

Found a gem we're missing?

Not Ideal For

Pros & Cons

Pros

Cons

Frequently Asked Questions