Irony

My story begins with slowness, like many IT stories before it. Let me give some context. This is regarding a custom application hosted at an MSP. The problem started about a week after production launch. Staff and the broader customer base began experiencing slowness and timeouts.

The usual answer ensues. "The servers look good" says MSP. "CPU doesn't go above 5%, tons of free memory, disk time is less than 1%, network utilization is low. It isn't anything on our side."

Now with the onus successfully shifted, the problem lands internally.

Pinging the application host at the MSP (MPLS circuit) produces a 1ms response time, impressive latency for sure. Next up, I was able to capture a "slow" event in wireshark:

The first line is a query from my client to the server. The next two packets are from the server to my client. Strange result for a machine "doing nothing" as the MSP said. It took over half a second to start returning the data from my query. This may not seem like a big deal, but as the week progressed the half a second turned into 5 seconds, then 10 seconds, then user revolt.

All the while, the MSP stood by the claim that the server was simply not busy. After much coercing (Higher-ups "talking" to MSP), I was granted access to see for myself. To my dismay, they were right. The server was not terribly busy.






















 
Or was it. That kernel memory usage number seems a tad askew. Look at the handles and threads. That is reason for concern.








What the? Why does the Windows Virtual Disk Service have 45K open handles and 907 threads. Also, as you can see, BMC Patrol (Server monitoring software) is on this server. On a hunch, I google Patrol and vds.exe, the result: http://www.sentrysoftware.com/support/Patches/showPatch.asp?patchID=P1574
Sure enough, "echo list volume|diskpart" mimics the behavior of the agent!

The result of stopping the Patrol Agent and restarting the Virtual Disk Service was drastic. Mr. kernel can breath again!

The irony of server monitoring software slowly killing a server is mildly amusing. Not while the issue is going on of course, or when 10k users are yelling, but in retrospect. Maybe it's just me, but I love irony.

Comments

Popular posts from this blog

Cisco VRF-Lite Guest Network and OpenDNS

Work Folders, Folder Redirection, Symbolic Links, Oh My!

Ansible: Good Things Come to Those Who Wait