These days, I've been traipsing around North America visiting various client sites to work on what has become an extremely wide variety of projects. At one site, upon my arrival, I was asked if I knew anything about storage (I do) and if I could help them try to figure out if they were having performance issues with their NetApp. The organization was beginning to experience troubles with both their vSphere environment as well as with a production Microsoft cluster, but they were having trouble identifying the root cause of the issues.
We had some small pieces of information to use as a starting point:
- The servers housing a Microsoft cluster were breaking and errors in their logs suggested storage as a root cause.
- Basic reporting tools for vSphere were showing some occasional latency issues, but nothing comprehensive.
To help my client, I turned to SolarWinds Storage Manager to help determine what could be going on. While the clues we had were helpful, they were incomplete pictures. The client's NetApp SAN serves multiple needs:
- Most of the client's vSphere environment is supported by the SAN.
- Many of the client's non-virtualized systems use the SAN for storage.
As such, relying on just vSphere for storage information resulted in an incomplete picture being presented. It was a good starting point, but was hardly comprehensive.
That's when I turned to SolarWinds Storage Manager. I allow Storage Manager to run for a few days and we began to get a much more complete picture of what was going on. We were able to gain a complete look at the entire storage story.
Armed with this information, we learned that there was, indeed, a major latency issue. We also learned that the worst of the issue was taking place at specific times throughout the day and we could track exactly which volumes were being affected in an attempt to identify a commonality. We did learn that all of the affected volumes were on the same aggregate, which gave us a place to focus. As such, we started to investigate that aggregate and it was eventually determined that someone had created a snapshot schedule for the entire aggregate. This operation was taking place even during the heaviest periods of activity, resulting in service-disrupting levels of latency.
The moral of the story: No matter what, having the right tools to monitor the environment is critical. Without them, administrators are left to try to guess at issues and may not have what's necessary to identify true culprits in a situation. With Storage Manager, we were able to much more easily identify the scope of the overall issue in an attempt to correct the issue and ensure that applications were getting the storage resources that they need.
What are your thoughts? How do you start troubleshooting potential storage-related performance issues?