Dealing with HTTP content in HTTPS pages

httphttpsimage

We have a site which is accessed entirely over HTTPS, but sometimes display external content which is HTTP (images from RSS feeds, mainly). The vast majority of our users are also stuck on IE6.

I would ideally like to do both of the following

Prevent the IE warning message about insecure content (so that I can show a less intrusive one, e.g. by replacing the images with a default icon as below)
Present something useful to users in place of the images that they can't otherwise see; if there was some JS I could run to figure out which images haven't been loaded and replace them with an image of ours instead that would be great.

I suspect that the first aim is simply not possible, but the second may be sufficient.

A worst case scenario is that I parse the RSS feeds when we import them, grab the images store them locally so that the users can access them that way, but it seems like a lot of pain for reasonably little gain.

Best Answer

Your worst case scenario isn't as bad as you think.

You are already parsing the RSS feed, so you already have the image URLs. Say you have an image URL like http://otherdomain.com/someimage.jpg. You rewrite this URL as https://mydomain.com/imageserver?url=http://otherdomain.com/someimage.jpg&hash=abcdeafad. This way, the browser always makes request over https, so you get rid of the problems.

The next part - create a proxy page or servlet that does the following -

Read the url parameter from the query string, and verify the hash
Download the image from the server, and proxy it back to the browser
Optionally, cache the image on disk

This solution has some advantages. You don't have to download the image at the time of creating the html. You don't have to store the images locally. Also, you are stateless; the url contains all the information necessary to serve the image.

Finally, the hash parameter is for security; you only want your servlet to serve images for urls you have constructed. So, when you create the url, compute md5(image_url + secret_key) and append it as the hash parameter. Before you serve the request, recompute the hash and compare it to what was passed to you. Since the secret_key is only known to you, nobody else can construct valid urls.

If you are developing in java, the Servlet is just a few lines of code. You should be able to port the code below on any other back-end technology.

/*
targetURL is the url you get from RSS feeds
request and response are wrt to the browser
Assumes you have commons-io in your classpath
*/

protected void proxyResponse (String targetURL, HttpServletRequest request,
 HttpServletResponse response) throws IOException {
    GetMethod get = new GetMethod(targetURL);
    get.setFollowRedirects(true);    
    /*
     * Proxy the request headers from the browser to the target server
     */
    Enumeration headers = request.getHeaderNames();
    while(headers!=null && headers.hasMoreElements())
    {
        String headerName = (String)headers.nextElement();

        String headerValue = request.getHeader(headerName);

        if(headerValue != null)
        {
            get.addRequestHeader(headerName, headerValue);
        }            
    }        

    /*Make a request to the target server*/
    m_httpClient.executeMethod(get);
    /*
     * Set the status code
     */
    response.setStatus(get.getStatusCode());

    /*
     * proxy the response headers to the browser
     */
    Header responseHeaders[] = get.getResponseHeaders();
    for(int i=0; i<responseHeaders.length; i++)
    {
        String headerName = responseHeaders[i].getName();
        String headerValue = responseHeaders[i].getValue();

        if(headerValue != null)
        {
            response.addHeader(headerName, headerValue);
        }
    }

    /*
     * Proxy the response body to the browser
     */
    InputStream in = get.getResponseBodyAsStream();
    OutputStream out = response.getOutputStream();

    /*
     * If the server sends a 204 not-modified response, the InputStream will be null.
     */
    if (in !=null) {
        IOUtils.copy(in, out);
    }    
}

Related Solutions

Python – How to download a file over HTTP

One more, using urlretrieve:

import urllib
urllib.urlretrieve("http://www.example.com/songs/mp3.mp3", "mp3.mp3")

(for Python 3+ use import urllib.request and urllib.request.urlretrieve)

Yet another one, with a "progressbar"

import urllib2

url = "http://download.thinkbroadband.com/10MB.zip"

file_name = url.split('/')[-1]
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
meta = u.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (file_name, file_size)

file_size_dl = 0
block_sz = 8192
while True:
    buffer = u.read(block_sz)
    if not buffer:
        break

    file_size_dl += len(buffer)
    f.write(buffer)
    status = r"%10d  [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
    status = status + chr(8)*(len(status)+1)
    print status,

f.close()

HTTP vs HTTPS performance

There's a very simple answer to this: Profile the performance of your web server to see what the performance penalty is for your particular situation. There are several tools out there to compare the performance of an HTTP vs HTTPS server (JMeter and Visual Studio come to mind) and they are quite easy to use.

No one can give you a meaningful answer without some information about the nature of your web site, hardware, software, and network configuration.

As others have said, there will be some level of overhead due to encryption, but it is highly dependent on:

Hardware
Server software
Ratio of dynamic vs static content
Client distance to server
Typical session length
Etc (my personal favorite)
Caching behavior of clients

In my experience, servers that are heavy on dynamic content tend to be impacted less by HTTPS because the time spent encrypting (SSL-overhead) is insignificant compared to content generation time.

Servers that are heavy on serving a fairly small set of static pages that can easily be cached in memory suffer from a much higher overhead (in one case, throughput was havled on an "intranet").

Edit: One point that has been brought up by several others is that SSL handshaking is the major cost of HTTPS. That is correct, which is why "typical session length" and "caching behavior of clients" are important.

Many, very short sessions means that handshaking time will overwhelm any other performance factors. Longer sessions will mean the handshaking cost will be incurred at the start of the session, but subsequent requests will have relatively low overhead.

Client caching can be done at several steps, anywhere from a large-scale proxy server down to the individual browser cache. Generally HTTPS content will not be cached in a shared cache (though a few proxy servers can exploit a man-in-the-middle type behavior to achieve this). Many browsers cache HTTPS content for the current session and often times across sessions. The impact the not-caching or less caching means clients will retrieve the same content more frequently. This results in more requests and bandwidth to service the same number of users.

Best Answer

Related Solutions

Python – How to download a file over HTTP

HTTP vs HTTPS performance

Related Topic