We were unable to load Disqus. If you are a moderator please see our troubleshooting guide.

Great investigation work and very educative post.

Thanks for sharing!!!

Ory Band • 7 years ago

Yet another great post by Brendan Gregg. Keep up the good work, educating developers worldwide.

Jon Currey • 7 years ago

Great stuff. Thank you!
Some people might enjoy Neil Gunther's follow up articles, where he digs in deeper and relates the exponential decay approach to queuing theory (https://www.teamquest.com/f... and analog circuits (https://www.teamquest.com/f.... The Multics paper acknowledges taking the lightweight method for computing the exponential average from signal processing. Specifically, 1950s work from Bell Labs, co-authored by JW Tukey, who also gave the world the Fast Fourier Transform and boxplots.

hmijail • 7 years ago

Interesting links. Note that they have an extra closing parens, which causes the link to fail!

Steve Mushero • 7 years ago

I'm confused by the latency that you mentioned - I just did lots of research lately as part of articles I wrote on Golden Signals / tools (runqstat) and I'm very confused by the CPU run queue latency in /proc/schedstat - this does not seem to really measure what you describe.

You wrote & I expect a sum of runnable process time waiting for a CPU, but in my measurements it vastly exceeds this number even on very lightly loaded systems.

Instead it seems to measure tasks in ANY non-running state, i.e. how long since a task has run, even if it has no work to do - it appears to be dozens of seconds for every second its sampled (and of course subtracted from the prior count value) - it simple cannot be higher than 1 second per second on an non-saturated system, but was dozens or hundreds on my tests (I can easily be doing something wrong, of course; and this doesn't really seem to be in Jiffies, either, or the results are even higher).

It appears this is used as part of the scheduling criteria to avoid starvation in the CFQ scheduler, i.e. the sum of all tasks not-running, so if there are 100 non-running tasks/processes, this will go up by about 100 seconds per second.

I hope I'm missing something, as I'd love to find a way to measure this, ideally in /proc and without BPF.

finnich • 7 years ago

Thanks for adding something new to my knowledge...really loved the post.

Hayden James • 7 years ago

Your level of detail, while yet being able to deliver uncomplicated guidance on often misunderstood topics, is simply amazing. Thanks!

Thomas Phifer • 7 years ago

This is good stuff. Thanks.

Noah Friedman • 7 years ago

Load average has always included processes in iowait, at least in BSD. Both the 4.3BSD and 4.4BSD books by McKusick et al were explicit about this.

The addition of other uninterruptible conditions may be new, but not disk io.

ZhangChao • 7 years ago

Really nice article, thanks for sharing! Now I understand what the load average is, it's really make me confused!
I have a question:
From my earlier graphs, at the one minute mark, 62% of the metric was from that minute, and the rest was older. So 0.62 x 1.15 + 0.38 x 1.25 = 1.18. That's pretty close to the 1.19 reported.

which is your earlier graphs?
what is the 0.62 and 0.38 mean? the time?

oh, I get it:
The so-called "one minute average" only reaches about 0.62 by the one minute mark.

vb2 • 7 years ago

I'm impressed by the persistence you displayed in tracking down the root cause. It is a skill that is lacking so much in IT. Thank you for educating us.
In regards to better metrics for CPU run queue length you mention the vmstat r column. There are different interpretations of the meaning of this column, with man vmstat stating "r: The number of processes waiting for run time." while others state something like "cpu + wait on cpu"
If man is correct, then r >0 implies cpu saturation. Is that interpretation correct?

Rituraj • 7 years ago

great work on this. While I agree with Matthias' POV
i.e:

quote - The point of "load average"[in Linux] is to arrive at a number relating how busy the system is from a human point of view. - unquote

..
I must add that its more like "waiting for godot" (ref uninterruptible), which basically IMO is not a positive and practical way to live a 'human' life. :)

Vijay Ram • 7 years ago

Another great post

Matt Fleming • 7 years ago

Fantastic article Brendan. You kept digging through the kernel source history long after most of us would have given up! This is a great detective story.

Sergey Senozhatsky • 7 years ago

yeah, one of the reasons why I want to rework `console_sem' sempahore is that TASK_UNINTERRUPTIBLE thing, which also adds up to LA.

-ss

Hagai Kariti • 7 years ago

Great post!

Shouldn't the time tar spent reading from disk be the one reported by iostat's r_await metric? Or at least dominated by it? Looking at the output, it seems that r_await*reads is a really small amount of time.

Jason Cai • 7 years ago

It's so informative and useful! I've investigated Linux's "load average" but I'm not able to figure out the mystery of TASK_UNINTERRUPTIBLE.
Thank you so much!

Stefan Parvu • 7 years ago

Nice blog entry.

Some comments: LA is not really an "industry-critical metric". People use them just because they been there for decades and better have them around anyway. In fact it is a very confusing metric in system performance measurements and sizing. Run queue length and other metrics are much useful and practical with a real value.

At Kronometrix we are researching and developing new ways to add useful metrics to our data recorders which can help us better understand systems overall or component level.

https://github.com/kronomet...

Bhaskar Chowdhury • 7 years ago

Magnificent digging Brendan! Learnt a lot and thanks a bunch for the pain taken by you to make it count.

Valentine Sinitsyn • 7 years ago

Great article, thanks!

It's a shame to say, but I don't fully get the math behind tar's LA decomposition:
* 0.32 from pidstat seems evident (%CPU is 32.3)
* 0.69 is supposedly 0.8576 * 0.8139 (tar share * vfs_read share). But why do one need to multiple these shares? These values are global, not relative to the parent.
* 0.04 is hard to get as there is no mpstat in the preceding figure, but there is the formula in the text.
* 0.11: the two leftmost towers on the off-cpu flame graph sum up to 0.14, not 0.11 - where do I get that number?

Many thanks

brendangregg • 7 years ago

tar: the off-CPU flame graph has 41,464 ms, and that's a sum over a 60 second trace. Normalizing that to 1 second = 41.464 / 60 = 0.69. The pidstat output has tar taking 32.62% average CPU (not a sum), and I know all its off-CPU time is in uninterruptible (by generating off-CPU graphs for the other states), so I can infer that 67.38% of its time is in uninterruptible. 0.67. I used that number instead, as the pidstat interval closely matched the other tools I was running.

The kworker threads have 3,684 and 3,102 ms. Normalizing to 1 second is 6.786 / 60 = 0.11.

fuyou • 7 years ago

0.04 where come from ? thank you very much!

brendangregg • 7 years ago

by mpstat I meant iostat sorry (I updated the text), but it's the same CPU summary. It's 0.54 + 4.03% for user + sys. That's 4.57% average across 8 CPUs, 4.57 x 8 = 36.56% in terms of one CPU. pidstat says that tar consumed 32.62%, so the remander is 36.56% - 32.62% = 3.94% of one CPU, which was used by things that weren't tar (other processes). That's the 0.04 added to load average.

Valentine Sinitsyn • 7 years ago

Thanks a lot, Brendan. It's crystal clear now. Shame on me again :)

Les Stroud • 7 years ago

Don't forget that, in a virtualized world, load average can also represent supply. The time to get on CPU, underlying io bus contention, etc can all show up as higher than normal loads.

I ran into this, more than once, when server load averages were changing over time with no change to the task executions themselves.

Vince Poore • 7 years ago

I think I might have run into this. Are there any articles or blogs you are aware of that would show the methodology to pin the problem to a hypervisor supply issue?

Matt Saladna • 7 years ago

That supply/demand response is conferred through steal% so long as the kernel supports it as a compile-time option. "iostat" will tell you what's going on, but as a whole unless you have 2 identical workloads on two different nodes, steal% is nothing conclusive apart from "hey, x% of the workload isn't getting serviced by the hv." and not necessarily a sign of oversubscription.

Michael Pinnegar • 7 years ago

Really nice very deep article!

MPW • 7 years ago

Thanks for this awesome article and the dive through Linux's history! Very interesting and fun to read.

AllenB • 7 years ago

Another great post, Brendan! I've always "known" that Linux load average included "stuff waiting on IO" but never had the details before. Hoorah!

David VomLehn • 7 years ago

Uninterruptible waits predate Linux. In UNIX they were used when waiting for a disk block to be read. Since the disk read was something that happened really fast (well, it seemed like it at the time) the process would start again very soon and so should still be considered runnable. In the Linux era, people didn't necessarily think about the consequences of unkillable processes and used uninterruptible waits as an easy wait to avoid accommodating signals in their implementations. This skewed the load average as well as causing reboots to get rid of unkillable processes. Over the past year or so, I think Linus Torvalds has become a bit more aware of the issue, but there is a lot of education needed.

In terms of actual programming, you should only use an uninterruptible wait if you *know* an operation will complete quickly. Otherwise, handle the signal in an interruptible wait or use a timeout.

MM • 7 years ago

Fascinating article. Thank you for sharing.

I noticed that something seems awry with the terminal output in "Decomposing Linux load averages". The 'termc' output for the second `uptime` command shows an earlier time of day but a longer uptime than the first command. Not critical but maybe some of the other output got mangled somehow also?

brendangregg • 7 years ago

Thanks, fixed, was supposed to be 22:17:14 -- I'd manually flipped them to 10:17:14 to match the timezone from the other command, then changed my mind and flipped them back (with a typo).

John Piers Cilliers • 7 years ago

Very nice! Informative to say the least!