mirror of
https://github.com/prometheus/docs.git
synced 2026-02-06 00:46:09 +01:00
Minor nits to agent blog post. (#2064)
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
This commit is contained in:
committed by
GitHub
parent
1e9dfa57c4
commit
ad095b8a67
@@ -13,7 +13,7 @@ What I personally love in the Prometheus project, and one of the many reasons wh
|
||||
* Most people nowadays expect cloud-native software to have an HTTP/HTTPS `/metrics` endpoint that Prometheus can scrape. A concept developed in secret within Google and pioneered globally by the Prometheus project.
|
||||
* The observability paradigm shifted. We see SREs and developers rely heavily on metrics from day one, which improves software resiliency, debuggability, and data-driven decisions!
|
||||
|
||||
In the end, we hardly see Kubernetes clusters without Prometheus running there. Partially because Kubernetes chose to only support Prometheus.
|
||||
In the end, we hardly see Kubernetes clusters without Prometheus running there.
|
||||
|
||||
The strong focus of the Prometheus community allowed other open-source projects to grow too to extend the Prometheus deployment model beyond single nodes (e.g. [Cortex](https://cortexmetrics.io/), [Thanos](https://thanos.io/) and more). Not mentioning cloud vendors adopting Prometheus' API and data model (e.g. [Amazon Managed Prometheus](https://aws.amazon.com/prometheus/), [Google Cloud Managed Prometheus](https://cloud.google.com/stackdriver/docs/managed-prometheus), [Grafana Cloud](https://grafana.com/products/cloud/) and more). If you are looking for a single reason why the Prometheus project is so successful, it is this: **Focusing the monitoring community on what matters**.
|
||||
|
||||
@@ -21,11 +21,11 @@ In this (lengthy) blog post, I would love to introduce a new operational mode of
|
||||
|
||||
## History of the Forwarding Use Case
|
||||
|
||||
The core design of Prometheus has been unchanged for the project's entire lifetime. Inspired by [Google's Borgmon monitoring system](https://sre.google/sre-book/practical-alerting/#the-rise-of-borgmon), you can deploy a Prometheus server alongside the applications you want to monitor, tell Prometheus how to reach them, and allow Prometheus to scrape the current values of their metrics at regular intervals. Such a collection method, which is often referred to as the "pull model", is the core principle that [allows Prometheus to be lightweight and reliable](https://prometheus.io/blog/2016/07/23/pull-does-not-scale-or-does-it/). Furthermore, it enables application instrumentation and exporters to be dead simple, as they only need to provide a simple human-readable HTTP endpoint with the current value of all tracked metrics (in OpenMetrics format), without complex push infrastructure and non-trivial client libraries. Overall, a simplified typical Prometheus monitoring deployment looks as below:
|
||||
The core design of Prometheus has been unchanged for the project's entire lifetime. Inspired by [Google's Borgmon monitoring system](https://sre.google/sre-book/practical-alerting/#the-rise-of-borgmon), you can deploy a Prometheus server alongside the applications you want to monitor, tell Prometheus how to reach them, and allow to scrape the current values of their metrics at regular intervals. Such a collection method, which is often referred to as the "pull model", is the core principle that [allows Prometheus to be lightweight and reliable](https://prometheus.io/blog/2016/07/23/pull-does-not-scale-or-does-it/). Furthermore, it enables application instrumentation and exporters to be dead simple, as they only need to provide a simple human-readable HTTP endpoint with the current value of all tracked metrics (in OpenMetrics format). All without complex push infrastructure and non-trivial client libraries. Overall, a simplified typical Prometheus monitoring deployment looks as below:
|
||||
|
||||

|
||||
|
||||
This works great, and we have seen millions of successful deployments like this over the years that process dozens of millions of active series. Some of them for longer time retention, like two years so. All allow to query, alert, and record metrics useful for both cluster admins and developers.
|
||||
This works great, and we have seen millions of successful deployments like this over the years that process dozens of millions of active series. Some of them for longer time retention, like two years or so. All allow to query, alert, and record metrics useful for both cluster admins and developers.
|
||||
|
||||
However, the cloud-native world is constantly growing and evolving. With the growth of managed Kubernetes solutions and clusters created on-demand within seconds, we are now finally able to treat clusters as "cattle", not as "pets" (in other words, we care less about individual instances of those). In some cases, solutions do not even have the cluster notion anymore, e.g. [kcp](https://github.com/kcp-dev/kcp), [Fargate](https://aws.amazon.com/fargate/) and other platforms.
|
||||
|
||||
@@ -37,8 +37,9 @@ What does that mean? That means monitoring data has to be somehow aggregated, pr
|
||||
|
||||
Naively, we could think about implementing this by either putting Prometheus on that global level and scraping metrics across remote networks or pushing metrics directly from the application to the central location for monitoring purposes. Let me explain why both are generally *very* bad ideas:
|
||||
|
||||
* Scraping across network boundaries can be a challenge if it adds new unknowns in a monitoring pipeline. The local pull model allows Prometheus to know why exactly the metric target has problems and when. Maybe it's down, misconfigured, restarted, too slow to give us metrics (e.g. CPU saturated), not discoverable by service discovery, we don't have credentials to access or just DNS, network, or the whole cluster is down. By putting our scraper outside of the network, we risk losing some of this information by introducing unreliability into scrapes that is unrelated to an individual target. On top of that, we risk losing important visibility completely if the network is temporarily down. Please don't do it. It's not worth it. (:
|
||||
* Pushing metrics directly from the application to some central location is equally bad. Especially when you monitor a larger fleet, you know literally nothing when you don't see metrics from remote applications. Is the application down? Is my receiver pipeline down? Maybe the application failed to authorize? Maybe it failed to get the IP address of my remote cluster? Maybe it's too slow? Maybe the network is down? Worse, you may not even know that the data from some application targets is missing. And you don't even gain a lot as you need to track the state and status of everything that should be sending data. Such a design needs careful analysis as it can be a recipe for a failure too easily.
|
||||
🔥 Scraping across network boundaries can be a challenge if it adds new unknowns in a monitoring pipeline. The local pull model allows Prometheus to know why exactly the metric target has problems and when. Maybe it's down, misconfigured, restarted, too slow to give us metrics (e.g. CPU saturated), not discoverable by service discovery, we don't have credentials to access or just DNS, network, or the whole cluster is down. By putting our scraper outside of the network, we risk losing some of this information by introducing unreliability into scrapes that is unrelated to an individual target. On top of that, we risk losing important visibility completely if the network is temporarily down. Please don't do it. It's not worth it. (:
|
||||
|
||||
🔥 Pushing metrics directly from the application to some central location is equally bad. Especially when you monitor a larger fleet, you know literally nothing when you don't see metrics from remote applications. Is the application down? Is my receiver pipeline down? Maybe the application failed to authorize? Maybe it failed to get the IP address of my remote cluster? Maybe it's too slow? Maybe the network is down? Worse, you may not even know that the data from some application targets is missing. And you don't even gain a lot as you need to track the state and status of everything that should be sending data. Such a design needs careful analysis as it can be a recipe for a failure too easily.
|
||||
|
||||
> NOTE: Serverless functions and short-living containers are often cases where we think about push from application as the rescue. At this point however we talk about events or pieces of metrics we might want to aggregate to longer living time series. This topic is discussed [here](https://groups.google.com/g/prometheus-developers/c/FPe0LsTfo2E/m/yS7up2YzAwAJ), feel free to contribute and help us support those cases better!
|
||||
|
||||
|
||||
Reference in New Issue
Block a user