There are so many monitoring products that it’s hard to keep track of them all. And the scale of information that these tools generate can be difficult to keep up with.
A few months ago, BigPanda *—a data-science platform that sorts through and correlates data thrown off by many of these tools—published a “MonitoringScape” that tries to list and categorize as many of the products as possible, and provide a mechanism for crowdsourcing corrections and updates.
Here’s the summary view:
It all begs the question: Why are there so many tools in this area, and how can they be combined to provide an effective monitoring system for organizations struggling to manage all their data and IT alerts?
In my view, monitoring tools fit into a generic workflow:
- Measure – there are many sources of raw measurements, from hardware all the way up to applications;
- Analyze – raw measurements are turned into information;
- Store – raw measurements and analytical results need to be kept over time;
- Visualize – users need to see what’s going on at every level
- Events – create and deliver notifications and actions to correct problems.
A lot of the diversity of monitoring tools comes from the many possible sources of measurements. Even common measures like CPU utilization can be collected in many ways, such as for bare-metal hardware, virtual machines, containers, processes and even inside applications. Network monitoring includes network switches but also all the protocols, connections and flows between systems.
Another problem impacting the effectiveness of today’s monitoring tools is that the scale and complexity of installations is growing rapidly. Tools have trouble coping with the flood of data and understanding the rapidly changing relationships between components. I summarized this conundrum in my Cloud Trends presentation, delivered earlier this summer, in the following “Tragic Quadrant” graphic. It’s tragic because most of the monitoring tools I’ve looked at fall into the bottom left corner of the chart, however some of the latest tools are trying to break out.
Almost every tool can handle hundreds of monitored systems in a slowly changing datacenter environment, where machines tend to have stable configurations for months or years. But this obviously isn’t the case at many fast-growing, Web-scale companies.
A new generation of cloud-aware tools have emerged that can handle self-service provisioning of thousands of cloud instances that exist for hours to weeks.
Companies like Netflix have had to build their own monitoring tools that can handle tens to hundreds of thousands of instances, as well as high rates of change as thousands of immutable instances are created and destroyed by daily auto-scaling and each code push. Google has to manage millions of containers, coming and going in seconds, and Amazon has introduced AWS Lambda, in which a container only exists for a fraction of a second to process a single request.
We are seeing tools become Docker-aware, so they can see containers. And the latest generation of tools has moved from collecting data every minute to gathering it up every second, so they can capture the rate of change. A few tools are also architected to scale up to tens of thousands of monitored instances, but it’s early days and few proof points exist.
It’s clear that no one tool is going to provide everything you need to monitor applications, infrastructure and networks, and at any point in time there may be migrations from older to newer tools going on. With so many tools, there’s a need to integrate events across all monitoring sources, in order to identify patterns, connections and anomalies. Failure or misconfiguration of a single hardware device or web service can ripple out and cause many far-flung problems that are hard to diagnose back to the root cause.
This is why it is important to have higher level event processing capabilities, like what BigPanda provides, that take event feeds from a wide range of tools and applications, and find the signal in the noise using powerful data science algorithms. There are a huge number of events coming from systems in normal use, and an even bigger flood of events during serious outages. Correlating these alerts and making sense of what is going on is a key capability for keeping services running and avoiding “firefighting overload” for site reliability engineering (SRE) teams.
Today, we are very happy to announce that BigPanda has joined the Battery Ventures portfolio. Our team is impressed with the company’s capabilities, leaders and progress to date, and we look forward to a long relationship with them.
*For a full list of all Battery investments and exits, please click here.