A horizontally scalable, highly available, multi-tenant, long-term storage solution for Prometheus and OpenTelemetry Metrics.
Cortex is a horizontally scalable, highly available, long-term storage solution for Prometheus and OpenTelemetry Metrics. It solves the problem of running Prometheus at massive scale by providing multi-tenancy, durable storage, and cluster-wide scalability while maintaining full Prometheus compatibility.
Site reliability engineers (SREs), DevOps teams, and platform engineers who need to operate Prometheus at scale in multi-tenant, cloud-native environments.
Developers choose Cortex because it extends Prometheus to enterprise-scale deployments without breaking compatibility, offering built-in multi-tenancy, horizontal scalability, and long-term storage integration with cloud object stores.
A horizontally scalable, highly available, multi-tenant, long term Prometheus.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Cortex can run across multiple machines in a cluster, exceeding the throughput and storage limits of a single Prometheus instance, as explicitly stated in the README's features.
It isolates data and queries from multiple independent Prometheus sources within a single cluster, enabling centralized monitoring for different teams or customers without interference.
Supports long-term storage with S3, GCS, Swift, and Azure, providing durable, cost-effective retention for metrics, as highlighted in the key features.
When deployed in a cluster, Cortex replicates data between machines to ensure reliability and fault tolerance, making it suitable for production environments.
As a distributed system, Cortex requires significant setup and management effort, with multiple components to configure and monitor, which is evident from the detailed architecture and configuration documentation.
Long-term storage is tightly coupled with specific cloud object stores, potentially tying users to those platforms and adding dependency on external services.
The distributed nature can introduce higher latency for queries compared to local Prometheus, which might impact real-time monitoring scenarios, a trade-off mentioned in scalability discussions.