AIOps promises to help companies intelligently manage IT operations, but the road there isn’t always smooth.
IT operations teams have a lot to juggle. They manage servers, networks, cloud infrastructure, user experience, application performance, and cybersecurity, often working independently of one another. Staffers are more often than not overworked, burdened with excessive alerts, and struggling to solve problems that involve multiple domains.
Enter AIOps, a burgeoning field of technologies and strategies that inject artificial intelligence into IT operations in an effort to solve challenges face by IT operations teams by reducing false positives, using machine learning to spot problems before they occur, automating remediation, and seeing a holistic view of the enterprise.
According to an October survey of IT leaders conducted by ZK Research and Masergy, 65% of companies are already using AIOps, and 94% say that AIOps is “important or very important” for managing network and cloud application performance. In addition, 84% see AIOps as a path to a fully automated network environment and 86% expect to have a fully automated network within the next five years.
Although AIOps is still new it is already proving its worth. According to a survey by Enterprise Management Associates released this summer, 62% of companies see “very high” or “high ROI from their AIOps investments, and the rest say they have broken even, or that it was too early to tell.
But the path to AIOps isn’t always smooth. More than half of the respondents to the EMA survey also said that AIOps was “challenging” or “very difficult” to implement. The most common obstacles companies reported include cost, data quality, conflicts within IT, distrust of AI, lack of skills, and integration challenges.
No clear strategy before adoption
Today’s IT organizations are operating under high pressure and it can feel like there’s not enough time for methodical preparation.
“Organizations are generally time-poor and resource-constrained,” says John Carey, managing director in the technology practice at AArete, a global management consulting firm.
Too often, AI projects start out as experiments that grow into opportunities. “You need a strategy,” Carey says. “AIOps needs to be thorough and planned.”
Rolling out a technology solution without first clearly defining the challenge you’re trying to solve is an age-old issue for IT, agrees Donncha Carroll, partner in the revenue growth practice at Axiom Consulting Partners. Carroll recommends companies take time to detail the nature of the problem they’re going to solve and how it’s going to impact the business.
“And confirm that a more conventional solution is not appropriate or effective,” he says. “Otherwise, you can invest a lot of dollars in implementing a solution that doesn’t deliver the vision that you have set up for it.”
In fact, according to the EMA survey, even though companies were universally positive about their AIOps investments, a staggering 80% are looking for a new platform — and half of them are planning to switch within the coming year.
The biggest reasons? They’re looking for more flexibility, scalability, and more advanced AI, ML and analytics. Such drastic switches underscore the fact that companies often forget to consider the broader picture in order to ensure the solution they pick can serve the business for the long term, Carroll says.
“It’s important to think about developing a comprehensive strategy, and then implement on a use-case basis,” he says.
Poor or incomplete data
According to the EMA survey, data issues are the second biggest hurdle to successful AIOps deployments, after cost.
AI and ML lives and dies on training data. But a company’s legacy operations systems might not be collecting performance data in a consistent manner. It may also be missing critical aspects or be reporting contradictory information.
“The market today is in its first-generation phase,” says Gregory Murray, senior research director at Gartner. “We’re analyzing the data that we have because it’s the data that we have.”
Something similar happened with hard drives, he says. For years now, hard drives have had instrumentations and analytics that predict drive failure, and they’re instrumented with exactly the telemetry they need to make those predictions.
“Outside that use case, you don’t need that data,” says Murray.
The same thing will happen with AIOps. As the industry deploys AIOps technology, we will learn more about what data actually needs to be collected.
“The promise is there for improved accuracy and precision once we start to generate data sets that are fit for the purpose,” he says.
When data is available, it might not necessarily be in a format that makes for a good training data set. For example, companies might want to know whether a particular change will cause problems based on the servers and applications impacted, says Jorge Machado, partner at McKinsey & Co. To do this analysis, the written description of the change is a critical factor.
“If it was poorly written, running natural language processing on that text wouldn’t give you any interesting insights,” says Machado. Similarly, an AI wouldn’t be able to pick out patterns in descriptions of open tickets if they’re poorly written, he adds.
More importantly, critical data sets are often incomplete. For example, a company might want to link an event to relevant applications, networks, or servers. “But no client has a perfect change management database,” Machado says, adding that these issues take significant work to resolve.
Inadequate coverage
To get the full benefits of AIOps, companies need to bring as many systems as possible under its umbrella, given that a problem in one part of the environment can have cascading effects somewhere else. A network problem might actually be a cybersecurity issue, or a user experience problem could be caused by a slow database server.
“As more companies migrate to digital, there are more interdependencies in applications,” says Machado. “If an application is underperforming, it’s likely to cause issues in other systems.”
But there are many obstacles to getting there. One is the cost of such a system. Another is the integration challenge of getting all the relevant data sources to work together. And there are organizational aspects that need to be addressed, says Machado. “Ultimately, your organizational fragmentation dictates your tool fragmentation.”
And it’s not just IT silos, he adds. AIOps needs inputs from other areas of the business to be effective. For example, if a company has a big product launch or a new marketing campaign or offers a large discount it could cause a spike in calls to a data center or traffic to a website and crash the system.
“You need to connect not just the application performance and server performance but events coming from the business side,” he says.
“The most successful AIOps implementations that we’ve seen have multi-departmental use cases,” agrees Will McKeon-White, an analyst at Forrester Research. Not just IT-related ones, like cybersecurity, but connections outside IT, such as to marketing, he says.
An AIOps system that collects real-time user monitoring data can become a shared business service, McKeon-White says, not just something that helps automate IT. “Those are the most successful use cases that we’ve seen.”
Paying double
Another issue that can cause internal organizational conflicts is when individual teams or departments have their own preferred toolsets and don’t want to give them up.
“Getting rid of other monitoring solutions can be a political nightmare in a lot of organizations,” says McKeon-White.
Companies often compromise, keeping their existing systems and adding an AIOps platform on top of that. But this can create duplication of functionality and increase integration challenges, he says, in addition to increasing expenses. “Organizations are paying a significant amount for these tools and not getting the value they need.”
To solve this dilemma, some companies are turning to AIOps built into domain-specific systems. Application performance monitoring systems, for example, increasingly use AI and ML to spot problems. The big cloud vendors are also adding intelligent monitoring and automation solutions, as are database vendors, and cybersecurity platform vendors.
It’s a relatively easy way to get some AIOps features, but at the expense of being able to get a multidomain, multicloud view of operations.
Using built-in features is also faster than building or deploying a full AIOps platform, a project that typically takes 16 months or longer, says Bradley Shimmin, chief analyst for AI platforms, analytics, and data management at Omdia.
“Pulling together all those sources of information, all those signals coming from so many diverse sources — cloud, application sensor APIs, sensors on physical devices — all of that takes integration,” he says. “That is a challenge enterprises have been facing for decades now.”
Missing the big picture
Domain-specific platforms can provide native automation of their functionality and make the AI tools transparent to the users. But while maintaining silos does avoid the integration challenges, companies won’t see the full potential of AIOps.
“If you’re trying to do something like root cause analysis for a rise in latency, you need to be able to talk to the networking system, to the application server, to see across all the different domains,” Shimmin says. “Nobody wants to stand up a Jupyter notebook in order to check their network logs to see what happened with their latency.”
Eventually, a cloud provider might be able to offer a full range of AIOps functionality, which can be useful for companies that are all-in on a single cloud provider. “Then you can see the nirvana of AIOps being realized for you,” he says. “But it’s not something that you’re going to get today.”
Moreover, most companies are multicloud, Simmon says. In fact, the EMA survey shows a dramatic preference for having a single cross-domain AIOps platform. Of the companies that said their AIOps efforts were “extremely successful,” 80% were using a single platform. Of companies that were not using an AIOps platform, 57% were “marginally successful.”
So it’s not surprising that, while only 46% of companies overall use a single AIOps platform, the rest either plan to adopt one or are using more than one platform.
Culture change
Finally, many companies are finding their employees have a distrust of AI systems, or are reluctant to embrace change.
In the EMA survey, even at companies that reported the highest level of success with AIOps, 22% of respondents said that “fear or distrust of AI” was a top challenge to their AIOps initiatives, tying with “lack of skills” for fourth place on the list.
“There’s a fundamental distrust of a black-box approach, the one that says, ‘Don’t ask me why I came to the conclusion, but there’s the answer,’” says Sanjay Srivastava, chief digital officer at Genpact, a global digital transformation consultancy. “We try to break that with the explainable AI but in some ways, it works, and in some ways, it doesn’t.”
Managing AIOps also requires a different set of skills than traditional IT management, he says. “AI-oriented skills require more data engineering and being able to model AI algorithms.”
AIOps platforms are evolving rapidly to a point where they can automatically make operational decisions for companies, such as rerouting traffic, reallocating resources, and spinning up new instances. When not set up thoughtfully and carefully, however, things can easily go wrong, says AArete’s Carey.
“When you’re actually programming it to make decisions like shutting down systems, it can turn off your business,” he says. “That’s probably the worst-case scenario.”
More commonly, it may make expensive mistakes.
“A more usual outcome is that it will step in and keep adding servers and all of a sudden your cloud compute bill has gone from $20,000 an hour to $100,000 an hour,” he says. “That homework needs to be done.”