Jeff Dean’s Latency Numbers – Accuracy in Context of Varying Hardware Implementations

hardware

I'm referring this chart of latency numbers, attributed to Jeff Dean at Google.

The thing I don't understand is, do these numbers not vary from one set of hardware to the next? How can these be accurate for all different types of RAM, CPU, motherboard, hard drive, etc?

Best Answer

These numbers (also listed on Norvig's Teach yourself Programming in 10 years) are approximate, only useful as (order of) magnitude.

Actually, today's hardware (at least for desktop or laptops) does not vary that much even between a cheap 300€ laptop and a high-end 10k€ workstation. The speed varies by a factor of roughly 2 or 4 at most. Such a workstation can have a larger disk, more cores, cache, and RAM. However, this doesn't have much impact on the raw single-threaded performance.

Look at some figures on http://openbenchmarking.org/ or some CPU comparators.

The so called Moore's law is dying. My 3+ years old desktop at home (an i3770K) could be replaced (today, in march 2016) by some i6700 which is only 20% faster.

Related Solutions

Hardware – Understanding the Levels of Computing

The keyword for thinking about these things is abstraction.

Abstraction just means deliberately ignoring the details of a system so that you can think about it as a single, indivisible component when assembling a larger system out of many subsystems. It is unimaginably powerful - writing a modern application program while considering the details of memory allocation and register spilling and transistor runtimes would be possible in some idealized way, but it is incomparably easier not to think about them and just use high-level operations instead. The modern computing paradigm relies crucially on multiple levels of abstraction: solid-state electronics, microprogramming, machine instructions, high-level programming languages, OS and Web programming APIs, user-programmable frameworks and applications. Virtually no one could comprehend the entire system nowadays, and there isn't even a conceivable path via which we could ever go back to that state of affairs.

The flip side of abstraction is loss of power. By leaving decisions about details to lower levels, we often accept that they may be made with suboptimal efficiency, since the lower levels do not have the 'Big Picture' and can optimize their workings only by local knowledge, and are not as (potentially) intelligent as a human being. (Usually. For isntance, compiling HLL to machine code is nowadays often done better by machines than by even the most knowledgeable human, since processor architecture has become so complicated.)

The issue of security is an interesting one, because flaws and 'leaks' in the abstraction can often be exploited to violate the integrity of a system. When an API postulates that you may call methods A, B, and C, but only if condition X holds, it is easy to forget the condition and be unprepared for the fallout that happens when the condition is violated. For instance, the classical buffer overflow exploits the fact that writing to memory cells yields undefined behaviour unless you have allocated this particular block of memory yourself. The API only guarantees that something will happen as a result, but in practice the result is defined by the details of the system at the next lower level - which we have deliberately forgotten about! As long as we fulfill the condition, this is of no importance, but if not, an attacker who understands both levels intimately can usually direct the behaviour of the entire system as desired and cause bad things to happen.

The case of memory allocation bugs is particularly bad because it has turned out to be really, really hard to manage memory manually without a single error in a large system. This could be seen as a failed case of abstraction: although it is possible to do everything you need with the C malloc API, it is simply to easy to abuse. Parts of the programming community now think that this was the wrong place at which to introduce a level boundary into the system, and instead promote languages with automatic memory management and garbage collection, which loses some power, but provides protection against memory corruption and undefined behaviour. In fact, a major reason for still using C++ nowadays is precisely the fact that it allows you to control exactly what resources are acquired and released when. In this way, the major schism between managed and unmanaged languages today can be seen as a disagreement about where precisely to define a layer of abstraction.

The same can be said for many other major alternative paradigms in computing - the issue really crops up all the time where large systems have to be constructed, because we are simply unable to engineer solutions from scratch for the complex requirements common today. (A common viewpoint in AI these days is that the human brain actually does work like that - behaviour arising through feedback loops, massively interconnected networks etc. instead of separate modules and layers with simple, abstracted interfaces between them, and that this is why we have had so little success in simulating our own intelligence.)

Hardware Accelerated Text Processing – CPU vs GPU

(Disclaimer: I don't have information on the percentages of occurrence of various elementary string operations found in common software. The following answer is just my two cents of contribution to address some of your points.)

To give a quick example, Intel provides SIMD instructions for accelerating string operations in its SSE 4.2 instruction set (link to wikipedia article). Example of using these instructions to build useful language-level string functions can be found on this website.

What do these instructions do?

Given a 16-byte string fragment (either 16 counts of 8-bit characters or 8 counts of 16-bit characters),
In "Equal Each" mode, it performs an exact match with another string fragment of same length.
In "Equal Any" mode, it highlights the occurrence of characters which match a small set of characters given by another string. An example is "aeiouy", which detects the vowels in English words.
In "Range comparison" mode, it compares each character to one or more character ranges. An example of character range is azAZ, in which the first pair of characters specifies the range of lower-case English alphabets, and the second pair specifies the upper-case alphabets.
In "Equal ordered" mode, it performs a substring search.

(All examples above are taken from the above linked website, with some paraphrasing.)

Before beginning a discussion on this topic, it is necessary to gather the prerequisite knowledge needed for such discussion.

It is assumed that you already have college-level introductory knowledge of:

CPU architecture
Digital design and synthesis, which teaches introductory level of Hardware Description Language, such as Verilog or VHDL.
And finally, a practicum project where you use the above knowledge to build something (say, a simple 16-bit ALU, or a multiplier, or some hardware input string pattern detection logic based on state machine), and perform a cost counting (in logic gates and silicon area) and benchmarking of the things being built.

First of all, we must revisit the various schemes for the in-memory representation of strings.

This is because the practicum in hardware design should have informed you that a lot of things in hardware had to be "hard-wired". Hardware can implement complex logic but the wiring between those logic are hard-wired.

My knowledge in this aspect is very little. But just to give a few examples, rope, "cord", "twine", StringBuffer (StringBuilder) etc. are all legit contenders for the in-memory representation of strings.

Even in C++ alone, you still have two choices: implicit-length strings (also known as null-terminated strings), and explicit-length strings (in which the length is stored in a field of the string class, and is updated whenever the string is modified).

Finally, in some languages the designer has made the decision of making string objects immutable. That is, if one wish to modify a string, the only way to do so is to:

Copy the string and apply the modification on-the-fly, or
Refer to substrings (slices) of the original immutable string and declare the modifications that one wish to have applied. The modification isn't actually evaluated until the new result is consumed by some other code.

There is also a side question of how strings are allocated in memory.

Now you can see that, in software, a lot of wonderful (or crazy) design choices exist. There has been lots of research into how to implement these various choices in hardware. (In fact, this has been a favorite way of formulating a master's thesis for a degree in digital design.)

All this is fine, except that due to economic reasons, a hardware vendor cannot justify the cost of supporting a language/library designer's crazy ideas about what a string should be.

A hardware vendor typically has full access to every master's thesis written by every student (in digital design) in the world. Thus, a hardware vendor's decision of not including such features must be well-informed.

Now let's go back to the very basic, common-sense question: "What string operations are among the most-frequently performed in the typical software, and how can they benefit from hardware acceleration"?

I don't have hard figures, but my guess is that string copying verbatim is probably the #1 operation being performed.

Is string copying already accelerated by hardware? It depends on the expected lengths of strings. If the library code knows that it is copying a string of several thousand characters or more, without modification, it could have easily converted the operation into a memcpy, which internally uses CPU SIMD (vectorized instructions) to perform the memory movement.

Furthermore, on these new CPU architectures there is the choice of keeping the moved string in CPU cache (for subsequent operations) versus removing it from the CPU cache (to avoid cache pollution).

But how often does one need to copy such long strings?

It turns out that the standard C++ library had to optimize for the other case:

https://stackoverflow.com/questions/21694302/what-are-the-mechanics-of-short-string-optimization-in-libc

That is, strings with lengths in the low-ten's occur with such high frequency that special cases have to be made to minimize the overhead of memory management for these short strings. Go figure.