I am currently working on CUDA, DirectX. I want to know the data flow architecture of GPUs. The one presented for CUDA and DirectX are parallel programming model and graphics pipelines respectively. But, I want to know the data flow architecture (like, the data flow architecture of 8086 processors).
GPU Data Flow Architecture
directxgpu
Related Solutions
However, with all of these new systems, it seems as if GPUs are better than CPUs in every way.
This is a fundamental mis-understanding. Present GPU cores are still limited compared to current top-line CPUs. I think NVIDIA's Fermi architecture is the most powerful GPU currently available. It has only 32-bit registers for integer arithmetic, and less capability for branch prediction and speculative execution then a current commodity Intel processor. Intel i7 chips provide three levels of caching, Fermi cores only have two, and each cache on the Fermi is smaller than the corresponding cache on the i7. Interprocess communication between the GPU cores is fairly limited, and your calculations have to be stuctured to accommodate that limitation (the cores are ganged into blocks, and communication between cores in a block is relatively fast, but communication between blocks is slow).
A significant limitation of current GPUs is that the cores all have to be running the same code. Unlike the cores in your CPU, you can't tell one GPU core to run your email client, and another core to run your web server. You give the GPU the function to invert a matrix, and all the cores run that function on different bits of data.
The processors on the GPU live in an isolated world. They can control the display, but they have no access to the disk, the network, or the keyboard.
Access to the GPU system has substantial overhead costs. The GPU has its own memory, so your calculations will be limited to the amount of memory on the GPU card. Transferring data between the GPU memory and main memory is relatively expensive. Pragmatically this means that there is no benefit in handing a handful of short calculations from the CPU to the GPU, because the setup and teardown costs will swamp the time required to do the calculation.
The bottom line is that GPUs are useful when you have many (as in hundreds or thousands) of copies of a long calculation that can be calculated in parallel. Typical tasks for which this is common are scientific computing, video encoding, and image rendering. For an application like a text editor the only function where a GPU might be useful is in rendering the type on the screen.
(Disclaimer: I don't have information on the percentages of occurrence of various elementary string operations found in common software. The following answer is just my two cents of contribution to address some of your points.)
To give a quick example, Intel provides SIMD instructions for accelerating string operations in its SSE 4.2 instruction set (link to wikipedia article). Example of using these instructions to build useful language-level string functions can be found on this website.
What do these instructions do?
- Given a 16-byte string fragment (either 16 counts of 8-bit characters or 8 counts of 16-bit characters),
- In "Equal Each" mode, it performs an exact match with another string fragment of same length.
- In "Equal Any" mode, it highlights the occurrence of characters which match a small set of characters given by another string. An example is "aeiouy", which detects the vowels in English words.
- In "Range comparison" mode, it compares each character to one or more character ranges. An example of character range is
azAZ
, in which the first pair of characters specifies the range of lower-case English alphabets, and the second pair specifies the upper-case alphabets. - In "Equal ordered" mode, it performs a substring search.
(All examples above are taken from the above linked website, with some paraphrasing.)
Before beginning a discussion on this topic, it is necessary to gather the prerequisite knowledge needed for such discussion.
It is assumed that you already have college-level introductory knowledge of:
- CPU architecture
- Digital design and synthesis, which teaches introductory level of Hardware Description Language, such as Verilog or VHDL.
- And finally, a practicum project where you use the above knowledge to build something (say, a simple 16-bit ALU, or a multiplier, or some hardware input string pattern detection logic based on state machine), and perform a cost counting (in logic gates and silicon area) and benchmarking of the things being built.
First of all, we must revisit the various schemes for the in-memory representation of strings.
This is because the practicum in hardware design should have informed you that a lot of things in hardware had to be "hard-wired". Hardware can implement complex logic but the wiring between those logic are hard-wired.
My knowledge in this aspect is very little. But just to give a few examples, rope, "cord", "twine", StringBuffer
(StringBuilder
) etc. are all legit contenders for the in-memory representation of strings.
Even in C++ alone, you still have two choices: implicit-length strings (also known as null-terminated strings), and explicit-length strings (in which the length is stored in a field of the string class, and is updated whenever the string is modified).
Finally, in some languages the designer has made the decision of making string objects immutable. That is, if one wish to modify a string, the only way to do so is to:
- Copy the string and apply the modification on-the-fly, or
- Refer to substrings (slices) of the original immutable string and declare the modifications that one wish to have applied. The modification isn't actually evaluated until the new result is consumed by some other code.
There is also a side question of how strings are allocated in memory.
Now you can see that, in software, a lot of wonderful (or crazy) design choices exist. There has been lots of research into how to implement these various choices in hardware. (In fact, this has been a favorite way of formulating a master's thesis for a degree in digital design.)
All this is fine, except that due to economic reasons, a hardware vendor cannot justify the cost of supporting a language/library designer's crazy ideas about what a string should be.
A hardware vendor typically has full access to every master's thesis written by every student (in digital design) in the world. Thus, a hardware vendor's decision of not including such features must be well-informed.
Now let's go back to the very basic, common-sense question: "What string operations are among the most-frequently performed in the typical software, and how can they benefit from hardware acceleration"?
I don't have hard figures, but my guess is that string copying verbatim is probably the #1 operation being performed.
Is string copying already accelerated by hardware? It depends on the expected lengths of strings. If the library code knows that it is copying a string of several thousand characters or more, without modification, it could have easily converted the operation into a memcpy
, which internally uses CPU SIMD (vectorized instructions) to perform the memory movement.
Furthermore, on these new CPU architectures there is the choice of keeping the moved string in CPU cache (for subsequent operations) versus removing it from the CPU cache (to avoid cache pollution).
But how often does one need to copy such long strings?
It turns out that the standard C++ library had to optimize for the other case:
That is, strings with lengths in the low-ten's occur with such high frequency that special cases have to be made to minimize the overhead of memory management for these short strings. Go figure.
Best Answer
You're making the mistake of assuming that GPU architectures are as well-defined and consistent as CPU architectures; this isn't the case. GPU architectures can and will vary between different manufacturers, and can and do vary even between different hardware generations from the same manufacturer.
What APIs model is a graphics pipeline, which describes at quite a high level how the various stages interact. A recent example (for D3D11) may be viewed here: http://msdn.microsoft.com/en-us/library/windows/desktop/ff476882%28v=vs.85%29.aspx
Microsoft also publish register specs for D3D versions; for example the specs for vs_3_0 are here: http://msdn.microsoft.com/en-us/library/windows/desktop/bb172963%28v=vs.85%29.aspx
Both of these are still at a fairly high level and are API-specific rather than hardware-related. As explained above, there is not going to be any single set of hardware-specific info. Even if there was, you would not be able to make use of it as modern operating systems disallow direct access to hardware, and even if you could, any code you write making use of it would have a high risk of not being valid on other manufacturer's hardware, or on other hardware from the same manufacturer.
That's one of the reasons why graphics APIs exist - because all hardware is so different it's just not possible to meaningfully and usefully access the hardware at a lower level. In other words, the kind of info you're asking for is something that you shouldn't have to worry about - that's the job of the API and driver; you worry about your own code.