The other night I was on the computer later than usual, late enough that my WHS backup started. As what I was doing was low-impact, I decided to let the backup run instead of postponing it. After a couple hours, I noticed that not only was the backup not done, but the progress bar was crawling along far more slowly than I remembered it being originally. It was time to investigate.
As a first step, I remoted into the server and pulled up
taskmgr. On the Processes tab, no process (even
whsbackup.exe, processes that are usually heavily worked during normal WHS activities) appeared to be using more than a few percent of the CPU at a time, but the aggregate
CPU Usage listed at the bottom appeared pegged at 50%. As the server is dual-core, that meant an entire core was being sucked up by something not listed under Processes.
Experience suggested this was almost certainly due to hardware interrupts, but Task Manager doesn’t show CPU usage due to hardware interrupts explicitly. Process Explorer, however, does, and downloading and running
ProcExp confirmed the abnormal hardware interrupts.
At this point I could have tracked down the precise source of the interrupts with a tool like kernrate. (Adi Oltean provides a good guide on how to use it.) However, I decided to use a little intuitive problem solving first. Hardware interrupt issues tend be caused by two things:
- Broken drivers
- Hardware failures
While the drivers haven’t been updated in a while on that server, and may have bugs that have been fixed in newer versions, my usage patterns with the server have been pretty static. It was unlikely that I only recently uncovered a bug in the driver if I hadn’t seen it up to now. Further, with the number of hard drives in the box, I wouldn’t have been surprised that one of them was failing.
I checked the System event log, but found no recent errors that would suggest a failing disk (e.g.
atapi errors). I did find errors from a few weeks ago that correlated to when I thought one of my drives was failing; I had been able to correct that by reseating the loose SATA cable and no errors appeared after that point.
Might my problem still be related to that, though? Windows Home Server, because it’s based on Windows Server 2003, generally doesn’t allow one to run in AHCI mode for the disks; I had my BIOS configured to run the disk controllers in IDE mode. The Windows IDE driver (
atapi.sys) automatically adjusts the transfer mode for a channel when it runs into errors talking to the disk on that channel, using slower and slower DMA modes until it gives up and switches to PIO mode. PIO mode requires a lot of hardware interrupts to transfer data; that’s why DMA mode was introduced in the first place.
At that point, the problem was obvious. When the ATAPI driver detected the drive errors a few weeks ago, it put the channel into PIO mode in an attempt to eliminate the errors. I fixed the errors by reseating the cable, but never reset the IDE channel to which the drive was connected, so it was still in PIO mode, and thus generating a plethora of hardware interrupts. I confirmed this hypothesis in Device Manager; the
Secondary IDE Channel‘s
Advanced Settings dialog showed the
Current Transfer Mode as ‘
To fix it, I simply uninstalled the
Secondary IDE Channel and rebooted the server to let it redetect the channel. After reboot and redetection, the
Current Transfer Mode was restored to ‘
Ultra DMA Mode 6‘. Sure enough, hardware interrupts dropped to next-to-nothing, and backup speeds were restored to their former glory.