JavaScript – Understanding JavaScript’s Internal Data Structure

data structureshashingjavascript

Consider a basic js object:

var obj={x:1,y:'2'};

Is this stored internally as a hashtable or does js use a different mechanism for key value pairs? If they are hash tables does anyone know how they handle collisions?

Best Answer

"javascript internal" doesn't really mean anything, as far as I know, such thing isn't specified for Javascript.

For some Javascript engine or interpreter, the internal representation can be any number of things, whatever works for any given situation.

For long-lived objects of indeterminate lifetime, it's most likely a hash table, but also a list (or whatever) of property names in natural ("alphabetical") sorted order probably exists at least temporarily.

Arrays, as much as they exist in Javascript, probably also have a custom, optimized internal data structure, which may support faster indexed access than going through creating a hash table hash.

And then a JS engine doing JIT might for example see that the object is never used for anything, in which case that object can be nothing internally, the instance is just optimized away and never actually created.

Analysis

A hash table's strength is its fast lookup and insertion speed. To get that speed, one must forsake any semblance of order in the table: i.e. entries are all jumbled up. A list is acceptable to use as a table entry because while traversal is O(n), the lists tend to be short assuming the hash table is sufficiently large and the objects stored in the table are hashed using a good quality hashing algorithm.

A binary search tree (BST) has fast insertion and lookup at O(log₂ n). It also imposes a restriction on the elements it stores: there must be some way to order the elements. Given two elements A and B stored in the tree, it must be possible to determine if A comes before B or if they have equivalent order.

A hash table imposes no such restriction: elements in a hash table must have two properties. First, there must be a way to determine if they are equivalent; second, there must be a way to calculate a deterministic hash code. Order is not a requirement.

If your hash table elements do have an order, then you can use a BST as a hash table entry to hold objects with the same hash code (collisions). However, due to a BST having O(log₂ n) lookup and insertion, that means the worst case for the whole structure (hash table plus BST) is technically better than using a list as a table entry. Depending on the BST implementation it will require more storage than a list, but likely not much more.

Please note that normally the overhead and behavior of a BST brings nothing to the table in real world situations as hash table buckets, which is why the theoretical poor performance of a list is acceptable. In other words, the hash table compensates for the list's weakness by placing fewer items in each list (bucket). However: the problem specifically stated that the hash table cannot increase in size, and collisions are more frequent than is typical in a hash table.

Implementation

I am not going to put code here because honestly it is not really necessary and you did not give a language anyway.

What I would do is simply copy whatever standard hash table your language's standard library contains into a new class, then change the table bucket type from a list to a tree. Depending on the language and its standard library this may be a very trivial thing to do.

Normally I would not advocate copy and paste coding like this. However, it is an easy way to get a battle-tested data structure very quickly.

Git .git/objects/ Folder – Why It Is Subdivided in SHA-Prefix Folders

It is possible to put all the files in one directory, though sometimes that can become a bit large. Many file systems have a limit. You want to put a git repository on a FAT32 formatted drive on a USB stick? You can only store 65,535 files in a single directory. This means that it is necessary to subdivide the directory structure so that filling a single directory is less likely.

This would even become a problem with other file systems and larger git repositories. A relatively small git repo that I've got hanging out (about 360MiB) and it has 181,546 objects for 11k files. Pull the Linux repo and you've got 4,374,054 objects. If you were to put these all in one directory, it would be impossible to check out and would crash (for some meaning of 'crash') the file system.

So? You split it up by byte. Similar approaches are done with applications such as FireFox:

~/Li/Ca/Fi/Pr/7a/Cache $ ls
0/           4/           8/           C/           _CACHE_001_
1/           5/           9/           D/           _CACHE_002_
2/           6/           A/           E/           _CACHE_003_
3/           7/           B/           F/           _CACHE_MAP_

Beyond this, it also goes to a question of performance. Consider NTFS Performance with Numerous Long Filenames:

Windows NT takes a long time to perform directory operations on Windows NT file system (NTFS) formatted drives that contain a large number of files with long file names (names that do not conform to the 8.3 convention) in a single directory.

When NTFS enumerates files in a directory, it has to look up the 8.3 names associated with the long file names. Because an NTFS directory is maintained in a sorted state, corresponding long file names and 8.3 names are generally not next to one another in the directory listing. So, NTFS uses a linear search of the directory for every file present. As a result, the amount of time required to perform a directory listing increases with the square of the number of files in the directory. For small numbers of files (less than a few hundred) the time delay is negligible. But as the number of files in a directory increases to several thousand, the time required to perform a listing can increase to minutes, hours, or even days. The problem is aggravated if the long file names are very similar -- differing only in the last few characters.

With files named after SHA1 checksums, this could be a recipe for disaster and abysmal performance.

While the above is from a tech note from Windows NT 3.5 (and NTFS 1.2 - commonly used from 1995 to the early 2000s) this can also be seen in things such as EXT3 with implementations of the filesystem being linked lists requiring O(n) lookup. And even with that B-tree change:

While the HTree algorithm significantly improved lookup times, it could cause some performance regressions for workloads that used readdir() to perform some operation of all of the files in a large directory.
...
One potential solution to mitigate this performance issue, which has been suggested by Daniel Phillips and Andreas Dilger, but not yet implemented, involves the kernel choosing free inodes whose inode numbers meet a property that groups the inodes by their filename hash. Daniel and Andreas suggest allocating the inode from a range of inodes based on the size of the directory, and then choosing a free inode from that range based on the filename hash. This should in theory reduce the amount of thrashing that results when accessing the inodes referenced in the directory in readdir order. In it is not clear that this strategy will result in a speedup, however; in fact it could increase the total number of inode blocks that might have to be referenced, and thus make the performance of readdir() + stat() workloads worse. Clearly, some experimentation and further analysis is still needed.

Incidentally, this bit on how to improve performance was from 2005, the same year git was released.

As seen with Firefox and many other applications that have lots of hash cached files, the design of splitting up the cache by byte. It has negligible performance cost, and when used cross platform with systems that may be a bit on the old side, could very well be the difference between the program working or not.

Best Answer

Related Solutions

Is it possible to speed up a hash table by using binary search trees for separate chaining

Analysis

Implementation

Git .git/objects/ Folder – Why It Is Subdivided in SHA-Prefix Folders

Related Topic