C – Why Are C String Literals Read-Only?

cmemorystrings

What advantage(s) of string literals being read-only justify(-ies/-ied) the:

  1. Yet another way to shoot yourself in the foot

    char *foo = "bar";
    foo[0] = 'd'; /* SEGFAULT */
    
  2. Inability to elegantly initialize a read-write array of words in one line:

    char *foo[] = { "bar", "baz", "running out of traditional placeholder names" };
    foo[1][2] = 'n'; /* SEGFAULT */ 
    
  3. Complicating the language itself.

    char *foo = "bar";
    char var[] = "baz";
    some_func(foo); /* VERY DANGEROUS! */
    some_func(var); /* LESS DANGEROUS! */
    

Saving memory?
I've read somewhere (couldn't find the source now) that long time ago, when RAM was scarce, compilers tried to optimize memory usage by merging similar strings.

For example, "more" and "regex" would become "moregex". Is this still true today, in the era of digital blu-ray quality movies? I understand that embedded systems still operate in environment of restricted resources, but still, the amount of memory available has increased dramatically.

Compatibility issues?
I assume that a legacy program that would try to access read-only memory would either crash or continue with undiscovered bug. Thus no legacy program should try to access string literal and therefor allowing to write to string literal would not harm valid, non-hackish, portable legacy programs.

Are there any other reasons? Is my reasoning incorrect? Would it be reasonable to consider a change to read-write string literals in new C standards or at least add an option to compiler? Was this considered before or are my "problems" too minor and insignificant to bother anyone?

Best Answer

Historically (perhaps by rewriting parts of it), it was the contrary. On the very first computers of the early 1970s (perhaps PDP-11) running a prototypical embryonic C (perhaps BCPL) there was no MMU and no memory protection (which existed on most older IBM/360 mainframes). So every byte of memory (including those handling literal strings or machine code) could be overwritten by an erroneous program (imagine a program changing some % to / in a printf(3) format string). Hence, literal strings and constants were writable.

As a teenager in 1975, I coded in the Palais de la Découverte museum in Paris on old 1960s era computers without memory protection: IBM/1620 had only a core memory -which you could initialize thru the keyboard, so you had to type several dozens of digits to read the initial program on punched tapes; CAB/500 had a magnetic drum memory; you could disable writing some tracks thru mechanical switches near the drum.

Later, computers got some form of memory management unit (MMU) with some memory protection. There was a device forbidding the CPU to overwrite some kind of memory. So some memory segments, notably the code segment (a.k.a. .text segment) became read-only (except by the operating system which loaded them from disk). It was natural for the compiler and the linker to put the literal strings in that code segment, and literal strings became read only. When your program tried to overwrite them, it was bad, an undefined behavior. And having a read-only code segment in virtual memory gives a significant advantage: several processes running the same program share the same RAM (physical memory pages) for that code segment (see MAP_SHARED flag for mmap(2) on Linux).

Today, cheap microcontrollers have some read-only memory (e.g. their Flash or ROM), and keep their code (and the literal strings and other constants) there. And real microprocessors (like the one in your tablet, laptop or desktop) have a sophisticated memory management unit and cache machinery used for virtual memory & paging. So the code segment of the executable program (e.g. in ELF) is memory mapped as a read-only, shareable, and executable segment (by mmap(2) or execve(2) on Linux; BTW you could give directives to ld to get a writable code segment if you really wanted to). Writing or abusing it is generally a segmentation fault.

So the C standard is baroque: legally (only for historical reasons), literal strings are not const char[] arrays, but only char[] arrays that are forbidden to be overwritten.

BTW, few current languages permit string literals to be overwritten (even Ocaml which historically -and badly- had writable literal strings has changed that behavior recently in 4.02, and now has read-only strings).

Current C compilers are able to optimize and have "ions" and "expressions" share their last 5 bytes (including the terminating null byte).

Try to compile your C code in file foo.c with gcc -O -fverbose-asm -S foo.c and look inside the generated assembler file foo.s by GCC

At last, the semantics of C is complex enough (read more about CompCert & Frama-C which are trying to capture it) and adding writable constant literal strings would make it even more arcane while making programs weaker and even less secure (and with less defined behavior), so it is very unlikely that future C standards would accept writable literal strings. Perhaps on the contrary they would make them const char[] arrays as they morally should be.

Notice also that for many reasons, mutable data is harder to handle by the computer (cache coherency), to code for, to understand by the developer, than constant data. So it preferable to have most of your data (and notably literal strings) stay immutable. Read more about functional programming paradigm.

In the old Fortran77 days on IBM/7094, a bug could even change a constant: if you CALL FOO(1) and if FOO happened to modify its argument passed by reference to 2, the implementation might have changed other occurrences of 1 into 2, and that was a really naughty bug, quite hard to find.