Click to learn more about author Patrick Hubbard.
Make no mistake, distributed application architectures can bring tremendous benefits for business. For example, actually using Cloud the way it’s designed can achieve the cost savings promised years ago.
But for enterprise IT, distributed architectures can create a problem—many of the monitoring tools we’ve relied on for years no longer work. In other words, traditional infrastructure monitoring approaches must be rebooted to restore comprehensive visibility.
Fortunately, by adopting a few best practices, your team can maintain deep application performance visibility, even as component services become more and more distributed.
Differences in Environment
As engineers, we recommend technology and technical solutions based on several factors—suitability, ease of maintenance, cost, reliability, risk, and more. And monitoring often plays a big role, especially when it comes to platforms.
A vendor might have a new quantum computing-based CRM app that doubles sales, but if we can’t monitor it, we’ll try to avoid it. That has driven a symbiosis of tools and platforms resulting in the traditional “app stack” of packaged applications with compute and storage, usually virtualized. Monitoring tools for these systems are also mature—generally polling based, install, and see data.
With distributed applications, IT should embrace two new partners as equals to polling-based infrastructure metrics: log event data and transaction tracing. With distributed application architectures, admins must troubleshoot issues that simply can’t be solved any other way.
For example, distributed applications are often managed by orchestrators that autonomously reconfigure and deprovision resources based on changing demand. That can create two new issues not found in a traditional app stack.
First, it makes both logging sources and log data ephemeral, creating an additional operations task to configure log aggregation and storage requirements. Second, it requires ongoing observation of transactions—plus something akin to time travel—to accurately understand the actual VS intended interaction of all the applications components. Visio might work for mapping the campus LAN, but not for visualizing a bottleneck between a Redis cluster and MongoDB.
DevOps’ relationship with distributed applications also adds another wrinkle in the form of accelerated or even continuous delivery. It’s one thing for a developer to spot a memory leak during a bench test, but something else to troubleshoot a more atypical issue if it only occurs when spun up into 100 or 1000 containers, automatically distributed across multiple Kubernetes pods.
Admins can work around this challenge by including log, event, and tracing data alongside infrastructure, network, user, and digital experience monitoring metrics. Visualizing rate of change and logging deployment events overlaid on performance charts makes quick work of fixing a bad push.
Differences in Reliability Metrics
Among the chief advantages of Cloud is improved resiliency—not to be confused with availability. I’m guilty of perpetuating the myth that for each new “9” IT adds to an application’s availability, cost increases by an order of magnitude. That’s a bit of an exaggeration, but only just. Distributed applications allow us to adopt a different success metric: user experience of our business.
The point of resiliency is to end the obsession with an internal KPI (availability) that management cares about and, instead, focus on what could have the most impact for the business. Are customers happy, even during app component failures? Can we recover from a major database outage in a few minutes? Can we quickly remediate novel issues that might only happen once in 10,000 years? Can we detect application issues with more subtle business effects like decreasing yield?
At this point you might be thinking, “Great. All I have to do is completely change my tools, approach, processes, and culture, and I’ll have distributed application management in hand.” Fortunately, managing distributed applications is similar to almost everything in IT—a small number of best practices usually address 80% of the pain. In this case, there are three.
Best Practice No. 1 – Automation
If you’re responsible for distributed applications, or anything in the Cloud, you need to code, period. Actually, if you plan on having a career in IT five years from now you need to code. I’m not suggesting that admins must be software developers. We didn’t come here to do that.
But at a minimum, know enough of a language—Python is great—to convert routine manual tasks into something that runs when you’re not there. Resist the misconception that you can automate yourself out of a job. (If you haven’t seen Site Reliability Engineer (SRE) compensation, you should).
Not only can you find that using code to deploy and manage your distributed app monitoring elements simplifies your life, you can also find the quality of your data is better. And if you’re clever, you can use your understanding of code to talk to application developers and help them bake monitoring into the app or deployment bundle.
If components your users care about automatically send the telemetry you specify, not simply what devs need, you can dramatically simplify troubleshooting. You might even establish policy that apps aren’t approved for deployment unless they provide a defined set of metrics and integrate into your monitoring system. And you should not attempt continuous delivery unless the pipeline itself is instrumented and included in app performance views.
Best Practice No. 2 – Introduce New Metrics
Because distributed application architectures are being driven primarily by new business requirements, you’re going to need some new metrics to give you the freedom to take full advantage of them.
For better or worse, we’ve conditioned managers to leave us alone if the NOC dashboard is all green. But if you’re using Netflix Chaos Monkey for resiliency testing, then you’re going to have red status pop up all over with random frequency. System metrics like error rate juxtaposed with business metrics is usually enough to calm most managers.
Start thinking of metrics to measure application performance beyond simple feeds and speeds. Many times, with distributed applications, the true measure of the application’s performance is actually a leading business indicator, not an infrastructure indicator.
A web-based application that provides a booking service for a hotel is a good example. Tracking per site visitor room upgrades and package add-on would be a great way to measure performance, in addition to traditional latency and page load times. The more quickly data is fetched, transitions are processed, queues are cleared, and pages served, visitors are more likely to see and buy more.
For particularly critical applications with numerous services, there’s another metric you should also consider—digital experience monitoring. Where traditional monolithic applications can rely on human performance monitors to complain if the app performs poorly, that’s not the case when the app is a hairball of API interconnections.
Any one of those service interfaces could bottleneck or cause the app to break if the service becomes unavailable. Digital experience monitoring alters the concept of user experience monitoring to test like digital users—all the downline processes that depend on service APIs.
Best Practice No. 3 – Accept Change
While recent data from the 2018 IT Trends Report: The Intersection of Hype & Performance reveals that IT is adopting new technology at an accelerated rate, many of the components of distributed applications, from topology, to dependencies, to operations requirements, aren’t coming from operations. Distributed applications come from dev. And you routinely see the phrases “DevOps” and “culture change” together for a reason.
Like it or not, you’re going to have to find a way to not just work with dev, but collaborate. Developers must learn to care more about what happens in operations and work with them to suggest what metrics would be most helpful. Conversely, operations must learn to provide monitoring back to dev, and complete the feedback loop that will allow dev to improve performance and stability.
When you get to the root of the traditional demarcation between dev and ops—where advanced techniques like continuous deployment break—it’s a lack of common goals. Many organizations have found that monitoring, and the shared conversations about performance, cost, and benefits, facilitates collaboration between dev and ops. It’s a good first step in achieving the wider goals driving Cloud and automation in the first place.
Managing distributed applications isn’t that different from any other type once you eliminate missing metrics and contentious anecdotal feedback. It’s much easier to mitigate the complexity of distributed application operations when you’re working as a team and sharing the same level of rich data you’ve always relied on. It’s less painful to troubleshoot, service quality is higher, and executives are happier. And, if you really get developers and ops on the same page, you might even find both teams sleep better at night.