Http keep-alive in the modern age

haproxyhttpkeep-alivewebserver

So according to the haproxy author, who knows a thing or two about http:

Keep-alive was invented to reduce CPU
usage on servers when CPUs were 100
times slower. But what is not said is
that persistent connections consume a
lot of memory while not being usable
by anybody except the client who
openned them. Today in 2009, CPUs are
very cheap and memory is still limited
to a few gigabytes by the architecture
or the price. If a site needs
keep-alive, there is a real problem.
Highly loaded sites often disable
keep-alive to support the maximum
number of simultaneous clients. The
real downside of not having keep-alive
is a slightly increased latency to
fetch objects. Browsers double the
number of concurrent connections on
non-keepalive sites to compensate for
this.

(from http://haproxy.1wt.eu/)

Is this in line with other peoples experience? ie without keep-alive – is the result barely noticable now? (its probably worth noting that with websockets etc – a connection is kept "open" regardless of keep-alive status anyway – for very responsive apps).
Is the effect greater for people who are remote from the server – or if there are many artifacts to load from the same host when loading a page? (I would think things like CSS, images and JS are increasingly coming from cache friendly CDNs).

Thoughts?

(not sure if this is a serverfault.com thing, but I won't cross post until someone tells me to move it there).

Best Answer

Hey since I'm the author of this citation, I'll respond :-)

There are two big issues on large sites : concurrent connections and latency. Concurrent connection are caused by slow clients which take ages to download contents, and by idle connection states. Those idle connection states are caused by connection reuse to fetch multiple objects, known as keep-alive, which is further increased by latency. When the client is very close to the server, it can make an intensive use of the connection and ensure it is almost never idle. However when the sequence ends, nobody cares to quickly close the channel and the connection remains open and unused for a long time. That's the reason why many people suggest using a very low keep-alive timeout. On some servers like Apache, the lowest timeout you can set is one second, and it is often far too much to sustain high loads : if you have 20000 clients in front of you and they fetch on average one object every second, you'll have those 20000 connections permanently established. 20000 concurrent connections on a general purpose server like Apache is huge, will require between 32 and 64 GB of RAM depending on what modules are loaded, and you can probably not hope to go much higher even by adding RAM. In practice, for 20000 clients you may even see 40000 to 60000 concurrent connections on the server because browsers will try to set up 2 to 3 connections if they have many objects to fetch.

If you close the connection after each object, the number of concurrent connections will dramatically drop. Indeed, it will drop by a factor corresponding to the average time to download an object by the time between objects. If you need 50 ms to download an object (a miniature photo, a button, etc...), and you download on average 1 object per second as above, then you'll only have 0.05 connection per client, which is only 1000 concurrent connections for 20000 clients.

Now the time to establish new connections is going to count. Far remote clients will experience an unpleasant latency. In the past, browsers used to use large amounts of concurrent connections when keep-alive was disabled. I remember figures of 4 on MSIE and 8 on Netscape. This would really have divided the average per-object latency by that much. Now that keep-alive is present everywhere, we're not seeing that high numbers anymore, because doing so further increases the load on remote servers, and browsers take care of protecting the Internet's infrastructure.

This means that with todays browsers, it's harder to get the non-keep-alive services as much responsive as the keep-alive ones. Also, some browsers (eg: Opera) use heuristics to try to use pipelinining. Pipelining is an efficient way of using keep-alive, because it almost eliminates latency by sending multiple requests without waiting for a response. I have tried it on a page with 100 small photos, and the first access is about twice as fast as without keep-alive, but the next access is about 8 times as fast, because the responses are so small that only latency counts (only "304" responses).

I'd say that ideally we should have some tunables in the browsers to make them keep the connections alive between fetched objects, and immediately drop it when the page is complete. But we're not seeing that unfortunately.

For this reason, some sites which need to install general purpose servers such as Apache on the front side and which have to support large amounts of clients generally have to disable keep-alive. And to force browsers to increase the number of connections, they use multiple domain names so that downloads can be parallelized. It's particularly problematic on sites making intensive use of SSL because the connection setup is even higher as there is one additional round trip.

What is more commonly observed nowadays is that such sites prefer to install light frontends such as haproxy or nginx, which have no problem handling tens to hundreds of thousands of concurrent connections, they enable keep-alive on the client side, and disable it on the Apache side. On this side, the cost of establishing a connection is almost null in terms of CPU, and not noticeable at all in terms of time. That way this provides the best of both worlds : low latency due to keep-alive with very low timeouts on the client side, and low number of connections on the server side. Everyone is happy :-)

Some commercial products further improve this by reusing connections between the front load balancer and the server and multiplexing all client connections over them. When the servers are close to the LB, the gain is not much higher than previous solution, but it will often require adaptations on the application to ensure there is no risk of session crossing between users due to the unexpected sharing of a connection between multiple users. In theory this should never happen. Reality is much different :-)

Examples

Roger Pate

This is my name, which is an identifier. It is like a URI, but cannot be a URL, as it tells you nothing about my location or how to contact me. In this case it also happens to identify at least 5 other people in the USA alone.

4914 West Bay Street, Nassau, Bahamas

This is a locator, which is an identifier for that physical location. It is like both a URL and URI (since all URLs are URIs), and also identifies me indirectly as "resident of..". In this case it uniquely identifies me, but that would change if I get a roommate.

I say "like" because these examples do not follow the required syntax.

Popular confusion

From Wikipedia:

In computing, a Uniform Resource Locator (URL) is a subset of the Uniform Resource Identifier (URI) that specifies where an identified resource is available and the mechanism for retrieving it. In popular usage and in many technical documents and verbal discussions it is often incorrectly used as a synonym for URI, ... [emphasis mine]

Because of this common confusion, many products and documentation incorrectly use one term instead of the other, assign their own distinction, or use them synonymously.

URNs

My name, Roger Pate, could be like a URN (Uniform Resource Name), except those are much more regulated and intended to be unique across both space and time.

Because I currently share this name with other people, it's not globally unique and would not be appropriate as a URN. However, even if no other family used this name, I'm named after my paternal grandfather, so it still wouldn't be unique across time. And even if that wasn't the case, the possibility of naming my descendants after me make this unsuitable as a URN.

URNs are different from URLs in this rigid uniqueness constraint, even though they both share the syntax of URIs.

The maximum length of a URL in different browsers

Short answer - de facto limit of 2000 characters

If you keep URLs under 2000 characters, they'll work in virtually any combination of client and server software.

If you are targeting particular browsers, see below for more details on specific limits.

Longer answer - first, the standards...

RFC 2616 (Hypertext Transfer Protocol HTTP/1.1) section 3.2.1 says

The HTTP protocol does not place any a priori limit on the length of a URI. Servers MUST be able to handle the URI of any resource they serve, and SHOULD be able to handle URIs of unbounded length if they provide GET-based forms that could generate such URIs. A server SHOULD return 414 (Request-URI Too Long) status if a URI is longer than the server can handle (see section 10.4.15).

That RFC has been obsoleted by RFC7230 which is a refresh of the HTTP/1.1 specification. It contains similar language, but also goes on to suggest this:

Various ad hoc limitations on request-line length are found in practice. It is RECOMMENDED that all HTTP senders and recipients support, at a minimum, request-line lengths of 8000 octets.

...and the reality

That's what the standards say. For the reality, there was an article on boutell.com (link goes to Internet Archive backup) that discussed what individual browser and server implementations will support. The executive summary is:

Extremely long URLs are usually a mistake. URLs over 2,000 characters will not work in the most popular web browsers. Don't use them if you intend your site to work for the majority of Internet users.

(Note: this is a quote from an article written in 2006, but in 2015 IE's declining usage means that longer URLs do work for the majority. However, IE still has the limitation...)

Internet Explorer's limitations...

IE8's maximum URL length is 2083 chars, and it seems IE9 has a similar limit.

I've tested IE10 and the address bar will only accept 2083 chars. You can click a URL which is longer than this, but the address bar will still only show 2083 characters of this link.

There's a nice writeup on the IE Internals blog which goes into some of the background to this.

There are mixed reports IE11 supports longer URLs - see comments below. Given some people report issues, the general advice still stands.

Search engines like URLs < 2048 chars...

Be aware that the sitemaps protocol, which allows a site to inform search engines about available pages, has a limit of 2048 characters in a URL. If you intend to use sitemaps, a limit has been decided for you! (see Calin-Andrei Burloiu's answer below)

There's also some research from 2010 into the maximum URL length that search engines will crawl and index. They found the limit was 2047 chars, which appears allied to the sitemap protocol spec. However, they also found the Google SERP tool wouldn't cope with URLs longer than 1855 chars.

CDNs have limits

CDNs also impose limits on URI length, and will return a 414 Too long request when these limits are reached, for example:

(credit to timrs2998 for providing that info in the comments)

Additional browser roundup

I tested the following against an Apache 2.4 server configured with a very large LimitRequestLine and LimitRequestFieldSize.

Browser     Address bar   document.location
                          or anchor tag
------------------------------------------
Chrome          32779           >64k
Android          8192           >64k
Firefox          >64k           >64k
Safari           >64k           >64k
IE11             2047           5120
Edge 16          2047          10240

See also this answer from Matas Vaitkevicius below.

Is this information up to date?

This is a popular question, and as the original research is ~14 years old I'll try to keep it up to date: As of Sep 2020, the advice still stands. Even though IE11 may possibly accept longer URLs, the ubiquity of older IE installations plus the search engine limitations mean staying under 2000 chars is the best general policy.