HTTP Character Encoding – What Encoding Are HTTP Status and Header Lines?

character encodinghttp

If I was going to write a parser for HTTP, would I be able to assume the encoding of the HTTP headers and status line? Until I read the charset or encoding header, how could I tell what the encoding type was? I am given the impression that these lines will always be in ASCII.

I guess I am confused how HTTP handles various encodings within the same stream of data. I am getting the impression the status line and the headers can be in a different encoding than the body. Even in the case where the body is made up of multipart form-data, it sounds like the body has a single encoding. Some clarification/explanation would go a long way.

Best Answer

RFC 7230, the relevant part of the current version of the spec, is pretty clear and to the point:

3. Message Format

[…]
A recipient MUST parse an HTTP message as a sequence of octets in an encoding that is a superset of US-ASCII*. Parsing an HTTP message as a stream of Unicode characters, without regard for the specific encoding, creates security vulnerabilities due to the varying ways that string processing libraries handle invalid multibyte character sequences that contain the octet LF (%x0A).

This allows for (at minimum) using a conformant UTF-8 parser, because UTF-8 avoids encoding confusing ASCII-subset characters in its multibyte code units, so e.g. %x0A will always be correctly recognized as an actual LF character.

There's a further note that, once you've successfully parsed the basic message into its header key-value pairs plus the message body, you can begin parsing the pieces with a more relaxed or non-default approach, according to certain headers. This is especially useful with RFC 7231's Content-Type header:

3.1.1.1. Media Type

HTTP uses Internet media types [RFC2046] in the Content-Type (Section 3.1.1.5) and Accept (Section 5.3.2) header fields in order to provide open and extensible data typing and type negotiation.

RFC2046 is all about extending MIME to message bodies, and has in turn a nice clear section on the Charset parameter:

4.1.2. Charset Parameter

A critical parameter that may be specified in the Content-Type field for "text/plain" data is the character set. This is specified with a "charset" parameter, as in:
Content-type: text/plain; charset=iso-8859-1

It goes on to explain that other text/ media types should use the same charset semantics.

Note that Content-Encoding, Transfer-Encoding, and Content-Transfer-Encoding (obsolete) all refer to a very limited set of encodings for compression or chunking — not character sets.

_{*American National Standards Institute, "Coded Character Set -- 7-bit American Standard Code for Information Interchange", ANSI X3.4, 1986.}

HTTP 202 Accepted (HTTP/1.1)

You are looking for HTTP 202 Accepted status. See RFC 2616:

The request has been accepted for processing, but the processing has not been completed.

HTTP 102 Processing (WebDAV)

RFC 2518 suggests using HTTP 102 Processing:

The 102 (Processing) status code is an interim response used to inform the client that the server has accepted the complete request, but has not yet completed it.

but it has a caveat:

The server MUST send a final response after the request has been completed.

I'm not sure how to interpret the last sentence. Should the server avoid sending anything during the processing, and respond only after the completion? Or it only forces to end the response only when the processing terminates? This could be useful if you want to report progress. Send HTTP 102 and flush response byte by byte (or line by line).

For instance, for a long but linear process, you can send one hundred dots, flushing after each character. If the client side (such as a JavaScript application) knows that it should expect exactly 100 characters, it can match it with a progress bar to show to the user.

Another example concerns a process which consists of several non-linear steps. After each step, you can flush a log message which would eventually be displayed to the user, so that the end user could know how the process is going.

Issues with progressive flushing

Note that while this technique has its merits, I wouldn't recommend it. One of the reasons is that it forces the connection to remain open, which could hurt in terms of service availability and doesn't scale well.

A better approach is to respond with HTTP 202 Accepted and either let the user to get back to you later to determine whether the processing ended (for instance by calling repeatedly a given URI such as /process/result which would respond with HTTP 404 Not Found or HTTP 409 Conflict until the process finishes and the result is ready), or notify the user when the processing is done if you're able to call the client back for instance through a message queue service (example) or WebSockets.

Practical example

Imagine a web service which converts videos. The entry point is:

POST /video/convert

which takes a video file from the HTTP request and does some magic with it. Let's imagine that the magic is CPU-intensive, so it cannot be done in real-time during the transfer of the request. This means that once the file is transferred, the server will respond with a HTTP 202 Accepted with some JSON content, meaning “Yes, I got your video, and I'm working on it; it will be ready somewhere in the future and will be available through the ID 123.”

The client has a possibility to subscribe to a message queue to be notified when the processing finishes. Once it is finished, the client can download the processed video by going to:

GET /video/download/123

which leads to an HTTP 200.

What happens if the client queries this URI before receiving the notification? Well, the server will respond with HTTP 404 since, indeed, the video doesn't exist yet. It may be currently prepared. It may never been requested. It may exist some time in the past and be removed later. All that matters is that the resulting video is not available.

Now, what if the client cares not only about the final video, but also about the progress (which would be even more important if there is no message queue service or any similar mechanism)?

In this case, you can use another endpoint:

GET /video/status/123

which would result a response similar to this:

HTTP 200
{
    "id": 123,
    "status": "queued",
    "priority": 2,
    "progress-percent": 0,
    "submitted-utc-time": "2016-04-19T13:59:22"
}

Doing the request over and over will show the progress until it's:

HTTP 200
{
    "id": 123,
    "status": "done",
    "progress-percent": 100,
    "submitted-utc-time": "2016-04-19T13:59:22"
}

It is crucial to make a difference between those three types of requests:

POST /video/convert queues a task. It should be called only once: calling it again would queue an additional task.
GET /video/download/123 concerns the result of the operation: the resource is the video. The processing—that is what happened under the hood to prepare the actual result prior to request and independently to the request—is irrelevant here. It can be called once or several times.
GET /video/status/123 concerns the processing per se. It doesn't queue anything. It doesn't care about the resulting video. The resource is the processing itself. It can be called once or several times.

Best Answer

3. Message Format

3.1.1.1. Media Type

4.1.2. Charset Parameter

Related Solutions

HTTP – What Belongs in a Request Header vs the Request Body?

REST – HTTP Status Code for ‘Still Processing’

HTTP 202 Accepted (HTTP/1.1)

HTTP 102 Processing (WebDAV)

Issues with progressive flushing

Practical example

Related Topic