HTTP Character Encoding – What Encoding Are HTTP Status and Header Lines?

character encodinghttp

If I was going to write a parser for HTTP, would I be able to assume the encoding of the HTTP headers and status line? Until I read the charset or encoding header, how could I tell what the encoding type was? I am given the impression that these lines will always be in ASCII.

I guess I am confused how HTTP handles various encodings within the same stream of data. I am getting the impression the status line and the headers can be in a different encoding than the body. Even in the case where the body is made up of multipart form-data, it sounds like the body has a single encoding. Some clarification/explanation would go a long way.

Best Answer

RFC 7230, the relevant part of the current version of the spec, is pretty clear and to the point:

3. Message Format

[…]
A recipient MUST parse an HTTP message as a sequence of octets in an encoding that is a superset of US-ASCII*. Parsing an HTTP message as a stream of Unicode characters, without regard for the specific encoding, creates security vulnerabilities due to the varying ways that string processing libraries handle invalid multibyte character sequences that contain the octet LF (%x0A).

This allows for (at minimum) using a conformant UTF-8 parser, because UTF-8 avoids encoding confusing ASCII-subset characters in its multibyte code units, so e.g. %x0A will always be correctly recognized as an actual LF character.

There's a further note that, once you've successfully parsed the basic message into its header key-value pairs plus the message body, you can begin parsing the pieces with a more relaxed or non-default approach, according to certain headers. This is especially useful with RFC 7231's Content-Type header:

3.1.1.1. Media Type

HTTP uses Internet media types [RFC2046] in the Content-Type (Section 3.1.1.5) and Accept (Section 5.3.2) header fields in order to provide open and extensible data typing and type negotiation.

RFC2046 is all about extending MIME to message bodies, and has in turn a nice clear section on the Charset parameter:

4.1.2. Charset Parameter

A critical parameter that may be specified in the Content-Type field for "text/plain" data is the character set. This is specified with a "charset" parameter, as in:

Content-type: text/plain; charset=iso-8859-1

It goes on to explain that other text/ media types should use the same charset semantics.

Note that Content-Encoding, Transfer-Encoding, and Content-Transfer-Encoding (obsolete) all refer to a very limited set of encodings for compression or chunking — not character sets.

*American National Standards Institute, "Coded Character Set -- 7-bit American Standard Code for Information Interchange", ANSI X3.4, 1986.

Related Topic