Generate Word2Vec vectors for DBpedia entities from Wikipedia dumps, linking words and topics to structured knowledge.
Wiki2Vec is a toolset for generating Word2Vec vectors from Wikipedia dumps, specifically designed to produce embeddings for DBpedia entities. It transforms Wikipedia articles into a tokenized corpus where links are replaced with DBpedia IDs, allowing the training of models that capture semantic relationships between words and knowledge base entities. This enables applications like entity linking, semantic search, and knowledge graph-enhanced NLP.
NLP researchers, data scientists, and developers working on projects that require entity-aware word embeddings, such as knowledge graph integration, semantic similarity, or entity disambiguation tasks.
It provides an open-source, reproducible pipeline to create custom entity embeddings from any Wikipedia dump, offering flexibility in language, stemming, and model parameters unlike static pre-trained models. The integration with Spark ensures scalability for large datasets.
Generating Vectors for DBpedia Entities via Word2Vec and Wikipedia Dumps. Questions? https://gitter.im/idio-opensource/Lobby
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Transforms Wikipedia links into DBpedia IDs (e.g., 'dbpedia/Barack_Obama'), enabling Word2Vec vectors for both words and knowledge base entities, as shown in the corpus example in the README.
Supports Wikipedia dumps in various languages like English, Spanish, and German with configurable stemming, allowing custom model creation for diverse NLP tasks.
Includes scripts like prepare.sh to download, clean, stem, and tokenize dumps end-to-end, providing a ready corpus for training without proprietary datasets.
Leverages Apache Spark for corpus generation, efficiently handling large Wikipedia dumps, as noted in the Spark submission commands for distributed processing.
Pre-trained models are from 2015 and may not reflect current Wikipedia content, requiring manual regeneration for up-to-date embeddings.
Automated scripts install Java, Sbt, Spark, and numerous Python dependencies, which can be error-prone and limited to Ubuntu 14.04, as admitted in the 'Quick usage' section.
The README's ToDo list highlights missing features like handling Wikipedia redirections and intra-article coreference, which can impact embedding accuracy for linked entities.
Designed for batch corpus generation with Spark, not real-time inference, and the README notes performance issues with alternative Word2Vec tools on large corpora.