Programming Languages – Why Many Languages Have Only Arrays and Hashes

data structuresdata typeslanguage-designprogramming-languages

Many programming languages have only those 2 structures, and even some languages that have more structures still only provide special syntax for those 2; usually, [] and {}. Why is this? Is there anything special about those datatypes that is necessary for the completeness of the language?

Best Answer

There's nothing that particularly forces a language to have arrays and hashes as fundamental datatypes. Indeed, many don't (especially older languages). However, there are a few fundamental concepts involved which indicate that these sorts of mappings make for good data structures.

Firstly, the ordered collection where you perform lookups by index number. These are a very common structure that is very useful for the case where you've got a bunch of things and you want to be able to walk through them one by one or look the up by some index. The key reason why this is so popular is that the variation where the collection is compact and mapped onto a contiguous region of memory — the array — is very efficient and fast with modern hardware. It is this efficiency which is why arrays are very common (though not universal). The major alternative to the array is the linked list, which are also quite common; linked lists have linear time lookup (whereas arrays have constant-time lookup) but super-cheap insertion and deletion from the middle of the sequence.

The second major category of collection is a mapping from values of one type (that supports an equality test) to another type. This is a way of realizing a whole class of very simple functions in a memory-based data structure, and it is superb for implementing all sorts of other basic datatypes. The name of these things does vary though (e.g., “dictionary”, “associative array”) as does the implementation strategy; the most common three implementation strategies are the record/struct, the mapping tree, and the hash table. Structs are very common (and are in fact a partial hybrid between dictionaries and arrays, where the key is mapped to an offset into an array/memory block). Trees used to be very common, but have become less so as it turns out they tend to have surprisingly poor performance (their memory access pattern turns out to work poorly with the way CPU memory cache predictors work, which is unfortunate). Hash tables, which were relatively uncommon a few decades ago, work pretty well: they've got reasonable memory access patterns and are easy to implement well (which is definitely not true of trees!). Their major down-side is that they don't guarantee the order of iteration (though that is fixable with some extra complexity in the data structure design).

So, the real thing that languages are providing is ℤ⁺→α and α⁼→β maps. These are both generally very useful! One is done with arrays normally, because they are easy to implement and highly efficient for lookup (typically the most common operation), and the other is done with hash tables (or structures) normally, again because they are easy to implement and usually efficient for lookup. The reason why these two particular maps? They turn out to be sufficient for creating a great many other structures with minimal extra code (which in turn means minimal extra mistakes).