As a freelance Linux systems administrator, I see a lot of odd problems come up. Some I can easily solve using just the things I know (making it appear to be magic), and others require a lot of sleuthing work, which is the part I usually enjoy!
Recently one of my client’s Linux servers was performing poorly – very slow webpage load times, and a load average that had crept to over 20.00! There was nothing obvious slowing it down; CPU use was nominal, there weren’t too too many processes, and memory usage wasn’t unreasonable. However, this wasn’t my server (it was a dedicated server from a hosting company) and their custom Linux kernel did not have some key components I usually use when diagnosing a problem (namely per process IO accounting). It took some clever detective work to figure it out: I eventually traced the problem to a WordPress plugin being used by someone else on the server. I thought I would document my detective process, in case any other systems administrators run into similar problems.
There weren’t any extra processes running, apache seemed normal enough, no DOS attacks were present according to the log files, no SYN attacks, nor even slow loris that I could determine. Plus, there was a significant load present–that shouldn’t happen with SYN or slow loris attacks. There was enough load that apache eventually stopped responding to requests, and required a restart to serve pages again. After the restart, the load average dropped drastically for a few minutes, but then started to climb again.
One of the problems I had seen with this server in the past was apache running out of IPCs, so during one attempted restart, apache didn’t start with error:
No space left on device: Cannot create SSLMutex
To clean up the leftover IPCs, I ran:
ipcs -s | grep apache | awk '{ print $2 }' | xargs -n 1 ipcrm sem
but unfortunately that didn’t help.
I looked at the already running muninmonitoring agent to see that spikes started the day previously at about noon. I saw that in addition to a slightly higher number of apache processes than normal, there were a great many more interrupts, and, ta da!: IO load.
Next came trying to find where the IO load was coming from.
The custom kernel that the hosting company provided did not have the necessary per-process IO accounting support compiled into their kernel, so running the really handy iotop wasn’t an option.
I found a few posts detailing alternatives; the most promising was having the kernel log all the IO work it does to the dmesg log. I needed to first clear out the dmesg log:
dmesg -c > /dev/null
Then I enabled the IO logging and recorded all the logs to a file (as dmesg only has a limited size):
echo 1 > /proc/sys/vm/block_dump
while true; do sleep 1; dmesg -c >> /var/tmp/diskio; done
After letting the kernel log all the IOs for a few minutes, I disabled the IO reporting
echo 0 > /proc/sys/vm/block_dump
and sifted through the findings:
cat /var/tmp/diskio | \
awk -F"[() \t]" '/(READ|WRITE|dirtied)/ {activity[$1]++} END {for (x in activity) print x, activity[x]}'| \
sort -nr -k2
After some initially misleading results pointing at the too little memory (the processes and xfsbufd being responsible for all the IO) that were causing the problems (leading me to tune the swappiness (how much the system aggressively tries to cache files versus keeping programs in memory) of the system to try and lower the memory usage for cached files so things wouldn’t be paged out) I found that it was apache that was responsible for the extra IO.
That’s great, except without per process IO accounting, I still couldn’t follow in real time where the problem was.
Eventually, I stumbled upon htop, which I found while searching real-time IO monitoring. Because of the aforementioned problem with the kernel, I could not use that part of it, but it does have very convenient strace capabilities built in. When I ran into an httpd process that was with a particularly high CPU usage, I would quickly select it & run strace on it (by pressing ‘s’). By doing this often enough, I was able to watch what files the process was opening and thus find one particular web client that generated a much larger number of file open calls than other clients. After a quick look at the logs & backup files, I found that he had recently installed a couple new WordPress plugins, including W3 Total Cache. I have used this plugin to great success several times before, but the authors of W3 Total Cache do say that you need to be careful when configuring their plugin! In this case, the CDN portion of it was not configured correctly and it was constantly checking every single file in his large site and that was the cause of the huge IO load.