Almost every company is using at least some cloud services today, and they’re not just using packaged SaaS apps, PaaS services and IaaS virtual machines. Websites and custom apps are built using application programming interfaces (API) for everything from mapping and messaging, to analytics, fraud detection and speech recognition.
Software-as-a-service (SaaS) offerings often offer APIs that let you work with them through third-party apps and services, or even build your own. For example, more than 50 percent of Salesforce’s traffic — and revenue — comes through its APIs, rather than directly from its own Web-based service. For eBay, it’s 60 percent, and for Expedia it’s 90 percent. If you use Twilio for sending text messages for customer support or MasterCard fraud detection services, you’re relying on those APIs for your own key business processes. How do you measure and monitor them to find out if you’re getting an acceptable level of service?
Just because you’re using a cloud service, you can’t assume it will perform well. Outages are rare, and you can use geo-redundancy to make sure your cloud app fails over to another region if service in one region goes down. But while SLAs with vendors of hosted systems cover uptime, they rarely make any mention of latency. Too much latency, and your API calls will time out, creating a poor experience for your users and customers.
Microsoft encountered a problem like that when it started moving its Bing search engine to continuous deployment and started measuring some external dependencies for the first time, says Craig Miller, a Bing technical adviser at Microsoft. “We found services we were using where we didn’t realize the reliability wasn’t more than three nines,” he says.
You may not even know how many APIs and SaaS apps are in use across your company, now that many business units are buying their own cloud services, so you might want to start using tools that tell you what’s active in your network. Such offerings include Microsoft’s Azure Cloud App Discovery or Imperva’s Skyfence. This is particularly important if you have services that you inherited from a predecessor or acquired in a merger — you need to find out what you’re taking on.
[Related: APIs abound, but challenges remain]
But once you know, you need to do more than keep an eye on the status alerts for the services and APIs you’re relying on; and you can’t assume that SLAs will cover every conceivable problem.
API gateways — which can be cloud services like the Amazon API Gateway, Apigee API Gateway, Tibco’s Mashery API manager and Micorosoft’s Azure API Management system, or on-premises API managers like those available from Akana, Layer7 and 3scale — let you control access to APIs, and manage resources by policy, rate limit clients and see usage analytics. Some, like Azure API Management, function as proxies that let you control internal usage of external APIs. But mostly, they’re about packaging up and offering your own APIs and Web services, helping you scale, monitor and distribute them. That doesn’t help you much when it comes to understanding the external APIs and microservices that you’re probably using, often without measuring. And if you use a mix of internal and public APIs, you’ll want to think about monitoring those explicitly as well.
Latency and location
The question, according to David O’Neill, CEO of API analytics service APImetrics, is, “Who tells you about problems caused by APIs? Your dev and ops team, or your customers? Do you know what your infrastructure is doing for your customers end to end? You probably don’t, and if you did you’d be horrified.”
Working with APIs is more complex than many people understand, O’Neill maintains. “There are some really weird things people need to take into account that the industry is only just realizing,” he says. “We’ve created a new world of Web apps and services where companies have no idea about what they’re paying for, if they’re working and what they’re actually doing. People [rely on SLAs], but they have SLAs that don’t have the right measures. SLAs talk about uptime; they don’t talk about latency or responsiveness.”
APImetric’s Insight score works like a credit score for an API, giving you a quick overview of performance and variability.
You can’t just look in your own logs, or run your own tests, to see how responsive APIs are, because the problems may not be visible there — they only tell you what performance you’ll see when you access the API from your network. You can’t even use one cloud location to test from. For example, “if you call the Foursquare API from the Azure region in Virginia, it will either fail or it can take up to 20 minutes to return any response,” O’Neill says. “But that’s only from Azure in Virginia; from Azure in California it works just fine.”
Simple logs and measurements won’t tell you everything, and you certainly can’t rely on average latency readings. If the average latency you’re seeing is, say, 250 milliseconds, you have to check to see if it includes some unusually fast responses masking the ones that take 5 or 10 seconds. Remember to think about the scale of API calls: If you only make 1,000 calls, a 99.6 percent success rate means there are only four failures, and that level of performance isn’t likely to cause problems. But if you make 1 million calls, that’s 4,000 failures. And when you start using multiple APIs from different services, you have to consider latency between them, too.
Looking at a range of measurements and analyses gives you a clearer picture of API performance.
It gets worse when you have international customers. “If you’re measuring performance on your server, how are you going to measure that for customers in Japan? You can’t stick everything into AWS in Virginia and run a global service, but a lot of companies do just that,” O’Neill warns. “You have to look at it end to end. If you have Japanese customers and you’re in the East Coast AWS data center, there’s going to be latency because it’s just down to the speed of light how fast they can route a query. And if you’ve got three API calls behind the public one and each of them requires a transatlantic hop, that could be adding lots of latency. You won’t see that on your server — and your customer will not experience what you do.”
You also need to look at how you treat API calls once the response reaches your network. “People assume that APIs are just Web pages; they’re not,” O’Neill explains. “Just because it’s an HTTP call doesn’t mean it works like or can be treated like a Web call.”
U.K. energy provider First Utility was using a range of testing and monitoring tools to manage the APIs it uses to build the iOS and Android apps its customers use to monitor their energy usage in real time with smart meters, as well as customer account services on its website. But when First Utility tried the APImetrics service, it quickly discovered that its internal architecture didn’t treat API calls the same way as Web traffic and was actually throttling them, affecting the experience users were getting. Another APImetrics customer found that while Web traffic went through its enterprise service bus exactly as expected, API calls going through the same bus experienced significant delays.
In another case, the problems that seemed to be caused by an API were down to network logging software that was running out of disk space once a week: Every Tuesday afternoon while it was in maintenance mode deleting logs to make space, all the API calls it was logging were stuck in a task queue. That would trigger lots of complaints from customers, but when the engineers checked the service the next morning, the API and the system that called it would both be working normally again.
Network security, like deep packet inspection, might also affect how your network handles API calls and responses.
“It’s about taking an outside-in approach,” says O’Neill.” Until you focus on the end-to-end experience, you will not know how APIs are functioning for the apps you build, for your own users or for your customers.”
Is it slow or is it unusually slow?
Not all clouds are equal. According to measurements that APImetrics has made over the past few years, if you’re using DNS routing inside Azure, you’ll see worse performance. “What takes 10 milliseconds in an AWS cluster takes 60 milliseconds inside an Azure one,” O’Neill says. “When you switch to using IP addresses though, the Azure speed is the same as the AWS speed.”
Developers using APIs need to consider these issues and start choosing between different APIs and cloud services based on responsiveness as well as functionality and cost. And ops will need to monitor API responsiveness over time. That’s not necessarily happening today, O’Neill says, “because it’s hard to measure, and you need to think about where you’re measuring from, as well as what and how. It’s not getting measured, and it’s getting brushed under the carpet.
“Outliers are items that would effectively have fallen so far outside the statistical parameters that there’s no point pretending they actually worked,” says APImetrics CEO David O’Neil.
For that to change, and to make API performance management more common, business leaders will have to start caring about those questions. “It’s all about taking issue-detection and performance monitoring out of DevOps and putting it in the hands of business managers and owners where it really belongs,” O’Neill argues. “Revenue-generating, customer-facing teams will be the ones who will be responsible for this stuff, and they’re going to call the shots.”
To serve that audience, APImetrics is building metrics that are designed to be more comprehensible for business leaders who aren’t networking experts. One approach is a number indicating the reliability of an API. O’Neill compares it to a credit score. “One API might be slow but consistently slow. You need to worry about the variation from what you expect rather than the speed,” he explains. “If 5 percent of the time it’s too slow to use, then that’s a problem.”
Another useful exercise is to look at outliers: How many calls have been exceptionally poor?
“What you need to know is not whether the API is up, but whether it’s functioning well enough to serve your customers. That sounds like the same thing, but they’re subtly different,” O’Neill cautions. “You need to be able to see what’s statistically normal and when you’re not getting that. You might not like what’s statistically normal for your API, but that’s a different problem.”
This article was written by Mary Branscombe from CIO and was legally licensed through the NewsCred publisher network.