Advertisement

What Architecture Can Teach Us About Self-Healing Systems

By on
Read more about author Phil Tee.

DevOps teams and site reliability engineers (SREs) deal with code daily. Doing so teaches them to scrutinize their world, make astute observations, and draw unexpected connections. After all, although highly logical and mathematical in nature, software development is, at least in part, art form. 

Unconvinced by that statement? Consider the parallels between history’s most remarkable architectural feats and modern software engineering. It is an apt comparison: Just like software engineering, architecture employs complex mathematical calculations to create something beautiful. And in both disciplines, a slight miscalculation can lead to significant consequences. Fascinatingly, many famous architectural mistakes are analogous to issues we find in code.

Remember, inspiration is everywhere – as long as you know where to look. Here are a few lessons software engineers can learn from architectural epiphanies through the centuries, especially regarding the future of self-healing systems.

Lesson 1: Edge cases will always exploit system vulnerabilities

Citicorp Tower – now called 601 Lexington – finished construction in New York City in 1977, at which time it was the seventh tallest building in the world. The skyscraper’s state-of-the-art design included three 100-plus-foot stilts. It was a marvel at completion. However, an undergraduate student soon discovered something jarring: Strong winds could jeopardize the building’s integrity. Specifically, if powerful quartering winds hit the corners of Citicorp Tower, the structure was subject to collapse – a literal edge case.

The tower had a one-in-16 chance of collapsing each year. These odds may entice someone sitting at a gambling table, but the outlook was grim for the architects and structural engineers behind Citicorp Tower. Thankfully, technicians were able to reinforce the building’s bolted joints. Disaster was avoided.

Structural engineers knew Citicorp Tower would eventually face a wind strong enough to compromise its bearings. Similarly, seasoned software engineers know robust application performance monitoring (APM) and event management are not enough to protect a system from the inevitable edge cases. That is because static systems without machine learning (ML) capabilities cannot handle unexpected and unplanned new situations, such as quartering winds. When relying solely on monitoring tools, a human administrator must decipher errors and escalate the incident management process.

To reduce mean time to recover (MTTR)/mean time to detect (MTTD), DevOps teams must accept the high probability of edge cases and work to deploy self-learning solutions preemptively. This lesson goes a long way, as foresight is critical in engineering.

Lesson 2: “Building the plane as it flies” creates a never-ending cycle

Tragic events have delivered several of the most important lessons in aviation history. When a plane suffered immense decompression mid-flight and crashed in 1954, engineers ascertained that square passenger windows were an unnecessary stress point. Henceforth, planes were outfitted with rounded windows. Onboard fires led to new seating arrangements prioritizing ease of evacuation. These changes have saved countless lives.

In many industries – aviation included – there is no way to exhaustively stress-test a product. As mentioned earlier, edge cases are unavoidable. The biggest takeaway here is that software engineers must heed their system’s vulnerabilities when they present themselves. From there, they must address them expediently. Doing that requires two things: (1) identifying and tracking the right key performance indicators (KPIs) and (2) investing time and resources into improving systems based on relevant metrics.

The average engineering team invests in 16 to 40 monitoring tools, yet they often miss the mark on which metrics demonstrate success. Fewer than 15% of teams track MTTD, so they miss 66% of the incident lifecycle. And one-fourth of teams report missing their service level agreements (SLAs) despite significant investment into availability tracking. This tells us that data collection needs thorough, systematic analysis to cut it– point solutions are no longer enough.

Software engineers, DevOps teams, and SREs must prioritize processes and tools that extract value from overwhelming amounts of information about availability. Instead of simply observing a critical error, they must take a page from an aviation engineer’s book and make critical decisions, fast. The secret to doing so lies in AI.

Lesson 3: AI is a fundamental building block for self-healing systems

A wholly autonomous, perfectly functioning, self-healing system is ideal for any software engineer. Systems that patch themselves are good for customer satisfaction, as they eliminate costly consumer-facing downtime. Moreover, they are incredibly beneficial for IT service management (ITSM) functions, as they significantly reduce the need for tedious ticket management. Building such a system requires several components, many of which are currently out of reach. But we are closer to a self-healing reality than some may realize.

The lack of widespread AI adoption remains the biggest hurdle that self-healing systems face today. Although many businesses have adopted rudimentary AI or ML-based tools, the integrity of these tools is questionable. That is to say, many engineers deal with artificial intelligence for IT operations (AIOps) technologies that follow rules-based automation logic instead of autonomous AI algorithms. The distinction may seem minute, but in practice, it is the difference between hours of lost productivity and millions in possible losses.

The thing is, rules-based AIOps tools analyze the interactions between disparate point solutions and can likely identify common data errors. But automation-based systems cannot process the evolution of entirely new errors over time, nor can they predict novel malfunctions in data. That is because the human administrators coding these functions ask the system to follow an if this, then that logic pattern. Genuinely efficient AIOps tools mitigate errors that arise at all four classic telemetry points – from detection to resolution – by classifying new and problematic patterns before human technicians are even aware of their existence. 

While we await the imminent third wave of AI, this version of AIOps is the closest we have to self-healing systems. It will be interesting to track how current AIOps applications bleed into the future of AI, which will include fully realized automation and independent thought possibilities. Maybe then structural engineers, too, will reap the rewards of an AI-based, self-healing system.