Why container memory usage is doubled in cAdvisor metrics?

That’s because cAdvisor takes these values from cgroups. The structure of cgroups looks like a tree, where there are branches for each pod, and every pod has child cgroups for each container in it. This is how it looks (systemd-cgls):

├─kubepods
│ ├─podb0c98680-4c6d-4788-95ef-0ea8b43121d4
│ │ ├─799e2d3f0afe0e43d8657a245fe1e97edfdcdd00a10f8a57277d310a7ecf4364
│ │ │ └─5479 /bin/node_exporter --path.rootfs=/host --web.listen-address=0.0.0.0:9100
│ │ └─09ce1040f746fb497d5398ad0b2fabed1e4b55cde7fb30202373e26537ac750a
│ │   └─5054 /pause

The resource value for each cgroup is a cumulative for all its children. That’s how you got memory utilization doubled, you just summarized the total pod consumption with each container in it.

If you execute those queries in Prometheus, you would notice the duplicated values:

{pod="cluster-autoscaler-58b9c77456-krl5m"} 59076608
{container="POD",pod="cluster-autoscaler-58b9c77456-krl5m"} 708608
{container="cluster-autoscaler",pod="cluster-autoscaler-58b9c77456-krl5m"}  58368000

The first one is the parent cgroup. As you see, it has no container label. The two others in this example are the pause container and the actual application. Combining their values you will get the value of the parent cgroup:

>>> 708608 + 58368000 == 59076608
True

There are multiple ways to fix the problem. For example, you can exclude metrics without container name by using container!="" label filter.

Another (more difficult) way to solve this is to drop the cumulative metrics in metric_relabel_configs (prometheus.yml). I.e. you can write a relabeling rule that will drop metrics without a container name. Be careful with this one, you may accidentally drop all non-cadvisor metrics.

Leave a Comment