Is it possible to get GCC to compile UTF-8 with BOM source files

byte-order-markg++gccutf-8

I develop C++ cross platform using Microsoft Visual Studio on Windows and GCC on uBuntu Linux.

In Visual Studio I can use unicode symbols like "π" and "²" in my code. Visual Studio always saves the source files as UTF-8 with BOM (Byte Order Mark).

For example:

// A = π.r²
double π = 3.14;

GCC happily compiles these files only if I remove the BOM first. If I do not remove the BOM, I get errors like these:

wwga_hydutils.cpp:28:9: error: stray ‘\317’ in program

wwga_hydutils.cpp:28:9: error: stray ‘\200’ in program

Which brings me to the question:

Is there a way to get GCC to compile UTF-8 files without first removing the BOM?

I'm using:

Windows 7
Visual Studio 2010

and:

uBuntu Oneiric 11.10
GCC 4.6.1 (as provided by apt-get install gcc)

Edit:

As the first commenter pointed out, my problem was not the BOM, but having non-ascii characters outside of string constants. GCC does not like non-ascii characters in symbol names, but it turns out GCC is fully compatible with UTF-8 with BOM.

Best Answer

According to the GCC Wiki, this isn't supported yet. You can use -fextended-identifiers and pre-process your code to convert the identifiers to UCN. From the linked page:

perl -pe 'BEGIN { binmode STDIN, ":utf8"; } s/(.)/ord($1) < 128 ? $1 : sprintf("\\U%08x", ord($1))/ge;'

Related Solutions

C++ – How to get assembler output from C/C++ source in gcc

Use the -S option to gcc (or g++).

gcc -S helloworld.c

This will run the preprocessor (cpp) over helloworld.c, perform the initial compilation and then stop before the assembler is run.

By default this will output a file helloworld.s. The output file can be still be set by using the -o option.

gcc -S -o my_asm_output.s helloworld.c

Of course this only works if you have the original source. An alternative if you only have the resultant object file is to use objdump, by setting the --disassemble option (or -d for the abbreviated form).

objdump -S --disassemble helloworld > helloworld.dump

This option works best if debugging option is enabled for the object file (-g at compilation time) and the file hasn't been stripped.

Running file helloworld will give you some indication as to the level of detail that you will get by using objdump.

What’s the difference between UTF-8 and UTF-8 without BOM

The UTF-8 BOM is a sequence of bytes at the start of a text stream (0xEF, 0xBB, 0xBF) that allows the reader to more reliably guess a file as being encoded in UTF-8.

Normally, the BOM is used to signal the endianness of an encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary.

According to the Unicode standard, the BOM for UTF-8 files is not recommended:

2.6 Encoding Schemes

... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the “Byte Order Mark” subsection in Section 16.8, Specials, for more information.

Best Answer

Related Solutions

C++ – How to get assembler output from C/C++ source in gcc

What’s the difference between UTF-8 and UTF-8 without BOM

2.6 Encoding Schemes

Related Topic