This Stack Overflow post lists a fairly comprehensive list of situations where the C/C++ language specification declares as to be 'undefined behaviour'. However, I want to understand why other modern languages, like C# or Java, doesn't have the concept of 'undefined behavior'. Does it mean, the compiler designer can control all possible scenarios (C# and Java) or not (C and C++)?
Why C++ Has Undefined Behavior and C# or Java Don’t
cjavaprogramming-languagesundefined-behavior
Related Solutions
First, I'll note that although I only mention "C" here, the same really applies about equally to C++ as well.
The comment mentioning Godel was partly (but only partly) on point.
When you get down to it, undefined behavior in the C standards is largely just pointing out the boundary between what the standard attempts to define, and what it doesn't.
Godel's theorems (there are two) basically say that it's impossible to define a mathematical system that can be proven (by its own rules) to be both complete and consistent. You can make your rules so it can be complete (the case he dealt with was the "normal" rules for natural numbers), or else you can make it possible to prove its consistency, but you can't have both.
In the case of something like C, that doesn't apply directly -- for the most part, "provability" of the completeness or consistency of the system isn't a high priority for most language designers. At the same time, yes, they probably were influenced (to at least some degree) by knowing that it's provably impossible to define a "perfect" system -- one that's provably complete and consistent. Knowing that such a thing is impossible may have made it a bit easier to step back, breathe a little, and decide on the bounds of what they would try to define.
At the risk of (yet again) being accused of arrogance, I'd characterize the C standard as being governed (in part) by two basic ideas:
- The language should support as wide a variety of hardware as possible (ideally, all "sane" hardware down to some reasonable lower limit).
- The language should support writing as wide a variety of software as possible for the given environment.
The first means that if somebody defines a new CPU, it should be possible to provide a good, solid, usable implementation of C for that, as long as the design falls at least reasonably close to a few simple guidelines -- basically, if it follows something on the general order of the Von Neumann model, and provides at least some reasonable minimum amount of memory, that should be enough to allow a C implementation. For a "hosted" implementation (one that runs on an OS) you need to support some notion that corresponds reasonably closely to files, and have a character set with a certain minimum set of characters (91 are required).
The second means it should be possible to write code that manipulates the hardware directly, so you can write things like boot loaders, operating systems, embedded software that runs without any OS, etc. There are ultimately some limits in this respect, so nearly any practical operating system, boot loader, etc., is likely to contain at least a little bit of code written in assembly language. Likewise, even a small embedded system is likely to include at least some sort of pre-written library routines to give access to devices on the host system. Although a precise boundary is difficult to define, the intent is that the dependency on such code should be kept to a minimum.
The undefined behavior in the language is largely driven by the intent for the language to support these capabilities. For example, the language allows you to convert an arbitrary integer to a pointer, and access whatever happens to be at that address. The standard makes no attempt at saying what will happen when you do (e.g., even reading from some addresses can have externally visible affects). At the same time, it makes no attempt at preventing you from doing such things, because you need to for some kinds of software you're supposed to be able to write in C.
There is some undefined behavior driven by other design elements as well. For example, one other intent of C is to support separate compilation. This means (for example) that it's intended that you can "link" pieces together using a linker that follows roughly what most of us see as the usual model of a linker. In particular, it should be possible to combine separately compiled modules into a complete program without knowledge of the semantics of the language.
There is another type of undefined behavior (that's much more common in C++ than C), which is present simply because of the limits on compiler technology -- things that we basically know are errors, and would probably like the compiler to diagnose as errors, but given the current limits on compiler technology, it's doubtful that they could be diagnosed under all circumstances. Many of these are driven by the other requirements, such as for separate compilation, so it's largely a matter of balancing conflicting requirements, in which case the committee has generally opted to support greater capabilities, even if that means lack of diagnosing some possible problems, rather than limiting the capabilities to ensure that all possible problems are diagnosed.
These differences in intent drive most of the differences between C and something like Java or a Microsoft's CLI-based systems. The latter are fairly explicitly limited to working with a much more limited set of hardware, or requiring software to emulate the more specific hardware they target. They also specifically intend to prevent any direct manipulation of hardware, instead requiring that you use something like JNI or P/Invoke (and code written in something like C) to even make such an attempt.
Going back to Godel's theorems for a moment, we can draw something of a parallel: Java and CLI have opted for the "internally consistent" alternative, while C has opted for the "complete" alternative. Of course, this is a very rough analogy -- I doubt anybody's attempting a formal proof of either internal consistency or completeness in either case. Nonetheless, the general notion does fit fairly closely with the choices they've taken.
Java is compile once run anywhere. C++ is write once compile anywhere.
Best Answer
Undefined behaviour is one of those things that were recognized as a very bad idea only in retrospect.
The first compilers were great achievements and jubilantly welcomed improvements over the alternative - machine language or assembly language programming. The problems with that were well-known, and high-level languages were invented specifically to solve those known problems. (The enthusiasm at the time was so great that HLLs were somtimes hailed as "the end of programming" - as if from now on we would only have to trivially write down what we wanted and the compiler would do all the real work.)
It wasn't until later that we realized the newer problems that came with the newer approach. Being remote from the actual machine that code runs on means there is more possibility of things silently not doing what we expected them to do. For instance, allocating a variable would typically leave the initial value undefined; this wasn't considered a problem, because you wouldn't allocate a variable if you didn't want to hold a value in it, right? Surely it wasn't too much to expect that professional programmers wouldn't forget to assign the initial value, was it?
It turned out that with the larger code bases and more complicated structures that became possible with more powerful programming systems, yes, many programmers would indeed commit such oversights from time to time, and the resulting undefined behaviour became a major problem. Even today, the majority of security leaks from tiny to horrible are the result of undefined behaviour in one form or another. (The reason is that usually, undefined behaviour is in fact very much defined by things on the next lower level on computing, and attackers who understand that level can use that wiggle room to make a program do not only unintended things, but exactly the things they intend.)
Since we recognised this, there has been a general drive to banish undefined behaviour from high-level languages, and Java was particularly thorough about this (which was comparatively easy since it was designed to run on its own specifically designed virtual machine anyway). Older languages like C can't easily be retrofitted like that without losing compatibility with the huge amount of existing code.
Edit: As pointed out, efficiency is another reason. Undefined behaviour means that compiler writers have a lot of leeway for exploiting the target architecture so that each implementation gets away with the fastest possible implementation of each feature. This was more important on yesterday's underpowered machines than with today, when programmer salary is often the bottleneck for software development.