How to install Somoclu with GPU support on Mac?

Use the conda-forge version or compile from source with an OpenMP-friendly compiler, as the default wheel binaries lack parallelization. Follow the Somoclu Python interface documentation for CUDA setup specifics.

Somoclu vs scikit-learn for SOMs: which is better?

Somoclu excels in parallel performance for large datasets with GPU and cluster support, while scikit-learn's implementations are more integrated for small-scale use. Choose Somoclu for speed on big data, but expect more setup effort.

Why do CPU and GPU results differ in Somoclu?

Due to single-precision floats and the GPU's parallel reduction kernel, ties in distances may be resolved differently than the CPU's sequential approach. This is documented as a computational efficiency trade-off, not a bug.

How to visualize SOMs from Somoclu in Python?

Use the output codebooks and U-matrices with libraries like matplotlib or integrate with Databionic ESOM Tools. The Python interface provides data structures but no built-in plotting, requiring external visualization steps.

Can Somoclu handle online learning with streaming data?

No, it uses batch training for parallel efficiency, so it's unsuitable for real-time or incremental learning scenarios. It's designed for static, large datasets processed in full batches.

Is sparse data supported in the R interface of Somoclu?

No, the sparse kernel is only available in the command-line version; the R interface supports only dense CPU and GPU kernels, limiting its use for text mining directly in R.

Open-Awesome

somoclu

MITC1.7.6

A massively parallel library for training self-organizing maps on multicore CPUs, GPUs, and clusters with support for dense and sparse data.

Visit Website GitHub

279 stars74 forks0 contributors

What is somoclu?

Somoclu is a massively parallel library for training self-organizing maps (SOMs), which are unsupervised neural networks used for clustering, visualization, and dimensionality reduction of high-dimensional data. It solves the problem of slow SOM training by parallelizing computations across multicore CPUs, GPUs, and distributed clusters, supporting both dense and sparse data formats.

Target Audience

Data scientists, researchers, and machine learning practitioners working with large datasets who need efficient SOM training for tasks like exploratory data analysis, feature reduction, or pattern discovery in text mining and other domains.

Value Proposition

Developers choose Somoclu for its exceptional speed and scalability, leveraging OpenMP, CUDA, and MPI to handle large maps and datasets that would be infeasible with sequential implementations. Its multi-language interfaces and support for sparse data make it versatile for various research and production workflows.

Overview

Massively parallel self-organizing maps: accelerate training on multicore CPUs, GPUs, and clusters

Use Cases

Best For

Training large self-organizing maps with hundreds of thousands of neurons
Accelerating SOM computations using GPU hardware
Distributed SOM training across compute clusters with MPI
Processing high-dimensional sparse data from text mining applications
Integrating SOMs into Python, R, Julia, or MATLAB data science pipelines
Visualizing complex datasets with compatible tools like Databionic ESOM

Not Ideal For

Projects requiring built-in visualization without external tool dependencies
Small datasets where parallelization overhead outweighs performance gains
Environments without CUDA GPUs or MPI clusters for acceleration
Applications needing identical, deterministic results from CPU and GPU executions

Pros & Cons

Pros

Massively Parallel Execution

Exploits OpenMP for multicore CPUs, CUDA for GPU acceleration, and MPI for cluster computing, drastically reducing training time for large-scale datasets as highlighted in the parallelization features.

Multi-Platform and Language Support

Runs on Linux, macOS, and Windows with interfaces for Python, R, Julia, and MATLAB, enabling easy integration into diverse data science workflows as per the README's interface list.

Sparse Data Optimization

Includes a specialized sparse kernel for text mining and high-dimensional sparse vectors, handling efficient training where data is mostly zeros, a key feature mentioned for vector spaces.

Scalable to Large Maps

Capable of training maps with hundreds of thousands of neurons, supporting detailed representations of complex datasets as noted in the large-scale maps feature.

Cons

Kernel Result Inconsistency

GPU and CPU kernels can produce different maps due to single-precision floats and non-sequential reduction in GPU, acknowledged in the README as a known issue that requires awareness.

Limited Interface Features

MPI and sparse kernel support are not available through the Python, R, Julia, and MATLAB interfaces, restricting advanced parallel and sparse data use to command-line only.

Complex GPU Installation

On macOS, GPU support requires specific compilers or conda-forge; on Windows, missing DLLs like vcomp90.dll can cause errors, adding setup hurdles as detailed in the installation notes.

Frequently Asked Questions

Related Projects

Apache Superset

Apache Superset is a Data Visualization and Data Exploration Platform

Stars73,948

Forks17,933

Last commit18 hours ago

Plotly

Data Apps & Dashboards for Python. No JavaScript Required.

Interactive Data Visualization in the browser, from Python

Stars20,418

Forks4,262

Last commit22 hours ago

zipline

Zipline, a Pythonic Algorithmic Trading Library

Stars20,001

Forks5,016

Last commit2 years ago

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub