forge

HTTP 1.1 Connection Timeouts, Load Balancers, and Site Creation

HTTP 1.1 [RFC2616] does some really amazing things for performance – but most of that’s lost on the average consumer. In HTTP 1.0 we created a request and when the response was torn down we tore down the HTTP connection – and the underlying TCP channel. That doesn’t sound like a big deal. Really, how much effort is it to form a TCP channel? Not a ton, however, it does require what amounts to a three-phase commit. Do that handshake enough times and it starts taking a non-trivial amount of time.

With HTTP 1.1 we get a persistent connection. After the request/response the connection is left open. The connection is then reused for the next request. When pages have 20 or more requests on them this can make a difference. Why 20 or more requests? CSS, JavaScript, and Images are used by nearly every page.

Even better is we get RFC4559. Why do we care? Well among other things that RFC adds persistent authentication to the HTTP 1.1 connection. So we don’t have to authenticate each request. We can do it once and that authentication is valid while the HTTP 1.1 connection stays active. It reduces the load in validating the users credentials.

Help Your SharePoint User

This is all goodness but I ran into a problem where SharePoint had some really odd behavior when creating sites and sub-sites. With Silverlight enabled we would get a generic error, or the site being created was a duplicate. These errors seemed to occur while the Silverlight app was calling addgallery.aspx. If we disabled Silverlight and we created the site from the web UI we would get processing and instead of a redirect to the site we’d get a lovely less than (<) sign.

We were looking at ways to control the application of themes to the page – since this client allowed their users to select multiple themes and we needed to apply the default value. This takes a while. In our case with the site creation, the theme, master pages, etc. takes about 45 seconds to create a site.

The problem, as it turns out, was that the load balancer was timing out the sessions at 30 seconds. The result was that in the middle of a stream of requests, the rug was being pulled out from underneath the client. The behavior was to record the error but since it’s really unexpected it was providing generic errors – or the expected issue of a duplicate site in one case. The web UI simply didn’t know how to cope and stopped short of the redirect.

The solution was to match the Load Balancer setting to the IIS setting of 120 seconds for HTTP 1.1 timeout. Once we did that the problem seems to have disappeared.