Javascript – Null character in strings

google-chromejavascriptnull-terminatedstringunicode

Consider this string:

var s = "A\0Z";

Its length is 3, as given by s.length. Using console.log you can see the string isn't cut and that s[1] is "" and s.charCodeAt(1) is 0.

When you alert it in Firefox, you see AZ. When you alert it in Chrome/Linux using alert(s), the \0 terminates the string and you see A.

My question is: what should browsers and Javascript engines do? Is Chrome buggy here? Is there a document defining what should happen?

As this is a question about standard, a reference is needed.

Best Answer

What the browser should do is keep track of the string and its length separately since there are no null terminators present in the standard. (A string is just an object with a length).

What Chrome seems to do (I am taking your word for this) is use the standard C string functions which terminate at a \0. To answer one of your questions: Yes this to me constitutes a bug in Chrome's handling of the alert() function.

Formally the spec says:

A string literal is zero or more characters enclosed in single or double quotes. Each character may be represented by an escape sequence. All characters may appear literally in a string literal except for the closing quote character, backslash, carriage return, line separator, paragraph separator, and line feed. Any character may appear in the form of an escape sequence.

Also:

A string literal stands for a value of the String type. The String value (SV) of the literal is described in terms of character values (CV) contributed by the various parts of the string literal.

And regarding the NUL byte:

The CV [Character Value] of EscapeSequence :: 0 [lookahead ∉ DecimalDigit] is a <NUL> character (Unicode value 0000).

Therefore, a NUL byte should simply be "yet another character value" and have no special meaning, as opposed to other languages where it might end a SV (String value).

For Reference of (valid) "String Single Character Escape Sequences" have a look at the ECMAScript Language spec section 7.8.4. There is a table at the end of the paragraph listing the aforementioned escape sequences.

What someone aiming to write a Javascript engine could probably learn from this: Don't use C/C++ string functions. :)

Related Solutions

Javascript – How to check for an empty/undefined/null string in JavaScript

If you just want to check whether there's a truthy value, you can do:

if (strValue) {
    //do something
}

If you need to check specifically for an empty string over null, I would think checking against "" is your best bet, using the === operator (so that you know that it is, in fact, a string you're comparing against).

if (strValue === "") {
    //...
}

C# – How to get a consistent byte representation of strings in C# without manually specifying an encoding

Contrary to the answers here, you DON'T need to worry about encoding if the bytes don't need to be interpreted!

Like you mentioned, your goal is, simply, to "get what bytes the string has been stored in".
(And, of course, to be able to re-construct the string from the bytes.)

For those goals, I honestly do not understand why people keep telling you that you need the encodings. You certainly do NOT need to worry about encodings for this.

Just do this instead:

static byte[] GetBytes(string str)
{
    byte[] bytes = new byte[str.Length * sizeof(char)];
    System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
    return bytes;
}

// Do NOT use on arbitrary bytes; only use on GetBytes's output on the SAME system
static string GetString(byte[] bytes)
{
    char[] chars = new char[bytes.Length / sizeof(char)];
    System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);
    return new string(chars);
}

As long as your program (or other programs) don't try to interpret the bytes somehow, which you obviously didn't mention you intend to do, then there is nothing wrong with this approach! Worrying about encodings just makes your life more complicated for no real reason.

Additional benefit to this approach: It doesn't matter if the string contains invalid characters, because you can still get the data and reconstruct the original string anyway!

It will be encoded and decoded just the same, because you are just looking at the bytes.

If you used a specific encoding, though, it would've given you trouble with encoding/decoding invalid characters.

Best Answer

Related Solutions

Javascript – How to check for an empty/undefined/null string in JavaScript

C# – How to get a consistent byte representation of strings in C# without manually specifying an encoding

Related Topic