Ageing machines which control the internet were designed with an arbitrary limit to the amount of data they could handle. Yesterday, we hit that limit, says Matthew Sparkes
You may have noticed yesterday that your internet connection was rather sluggish, or perhaps went down entirely. You were not alone: problems were reported around the world. We even had issues here at the Telegraph.
Auction site eBay, for instance, collapsed for much of the day . The company has not explained the exact nature of the problem, but admitted in a statement that “technical experts identified this was due to upstream Internet Service Provider (ISP) issues”. Password manager LastPass was also affected, leaving customers locked out of their accounts.
The issue, according to many experts, was with something called the Border Gateway Protocol (BGP). You may never have heard of it, but it is absolutely vital to the operation of the internet and is causing large problems.
BGP is what tier-one ISPs, your last-mile ISP and various large networks use to route data from their own machines to others, and vice versa. When you visit a website, that data bounces all over the world, through machines belonging to all manner of companies and organisations. To make this work, machines called routers (large commercial versions of what you have at home) keep a table of known, trusted routes through the tangled web.
This routing table has been constantly growing in size as the internet expands and becomes more complex – more information needs to be stored in order to allow the router to bounce data to the correct destination along a logical route. Until late 2001, the size of the table was growing exponentially, which was clearly unsustainable. A big effort to implement more efficient methods was made which temporarily slowed expansion. But it didn’t last long.
Now we are at the point where some older routers are struggling to cope: their memory is too small and their processors not powerful enough. A full copy of the routing table now contains 512,000 routes and older hardware was never designed with that in mind. Many have a strict 512,000 route limit, put in place by programmers many years ago who were forced to arbitrarily choose a number; you don’t make something so capable that it can operate for a hundred years as the hardware cost would be enormous, but you must also ensure a practical lifespan. The result if often little more than an educated guess.
As these machines struggle, the effect is that ISPs experience outages, hosting companies have problems and websites either go down or slow down.
Twitter: Tony Hughes (CP) – If you noticed anything strange this morning, eg. inability to access websites, it was because the global routing table hit 512K at 9am BST
The problem has been anticipated for years, but replacing these machines is an expensive, non-trivial task. It’ll happen – especially now that BGP issues are causing such large problems – but not overnight.
Many are already back up and running, others were replaced months or years ago. More problems will also be alleviated slightly by a new protocol called IPv6 which will reduce the strain on BGP. But further outages cannot be ruled out.
Web hosting company Liquid Web tweeted yesterday that: “As ISPs have recovered from #512k active BGP routes being reached, many of our customers affected by these carrier issues have regained the ability to reach their sites. We are still currently up, working to get a timeframe when sites can be reached from all locations and from any ISP.”
Other problems with arbitrary numbers
9/9/99 – Years ago many computer programmers used this distant date to indicate that a list had come to an end; it was shorthand for “this list of dates is now finished”. That presented a problem, of course, when September 9, 1999 eventually rolled around.
Y2K – A similar problem struck when the year 2000 arrived. Back in the days when computer memory was an expensive luxury, not to be wasted lightly, programmers abbreviated years to their last two digits. In 2000 this made many programs think it was 1900. Many mock the scaremongering that took place in the years and months before this limit, but it sparked a flurry of activity that quite possibly averted disaster. A lack of chaos, in this case, was a sign of success, not inaccurate predictions.
Year 10,000 – Have we learned from Y2K? When the year 10,000 arrives – the first with five digits – will all of our computer systems cope? At least we have plenty of time to fix the issue.