C++ – Does the C++ standard mandate poor performance for iostreams, or am I just dealing with a poor implementation

ciostreamperformance

Every time I mention slow performance of C++ standard library iostreams, I get met with a wave of disbelief. Yet I have profiler results showing large amounts of time spent in iostream library code (full compiler optimizations), and switching from iostreams to OS-specific I/O APIs and custom buffer management does give an order of magnitude improvement.

What extra work is the C++ standard library doing, is it required by the standard, and is it useful in practice? Or do some compilers provide implementations of iostreams that are competitive with manual buffer management?

Benchmarks

To get matters moving, I've written a couple of short programs to exercise the iostreams internal buffering:

putting binary data into an ostringstream http://ideone.com/2PPYw
putting binary data into a char[] buffer http://ideone.com/Ni5ct
putting binary data into a vector<char> using back_inserter http://ideone.com/Mj2Fi
NEW: vector<char> simple iterator http://ideone.com/9iitv
NEW: putting binary data directly into stringbuf http://ideone.com/qc9QA
NEW: vector<char> simple iterator plus bounds check http://ideone.com/YyrKy

Note that the ostringstream and stringbuf versions run fewer iterations because they are so much slower.

On ideone, the ostringstream is about 3 times slower than std:copy + back_inserter + std::vector, and about 15 times slower than memcpy into a raw buffer. This feels consistent with before-and-after profiling when I switched my real application to custom buffering.

These are all in-memory buffers, so the slowness of iostreams can't be blamed on slow disk I/O, too much flushing, synchronization with stdio, or any of the other things people use to excuse observed slowness of the C++ standard library iostream.

It would be nice to see benchmarks on other systems and commentary on things common implementations do (such as gcc's libc++, Visual C++, Intel C++) and how much of the overhead is mandated by the standard.

Rationale for this test

A number of people have correctly pointed out that iostreams are more commonly used for formatted output. However, they are also the only modern API provided by the C++ standard for binary file access. But the real reason for doing performance tests on the internal buffering applies to the typical formatted I/O: if iostreams can't keep the disk controller supplied with raw data, how can they possibly keep up when they are responsible for formatting as well?

Benchmark Timing

All these are per iteration of the outer (k) loop.

On ideone (gcc-4.3.4, unknown OS and hardware):

ostringstream: 53 milliseconds
stringbuf: 27 ms
vector<char> and back_inserter: 17.6 ms
vector<char> with ordinary iterator: 10.6 ms
vector<char> iterator and bounds check: 11.4 ms
char[]: 3.7 ms

On my laptop (Visual C++ 2010 x86, cl /Ox /EHsc, Windows 7 Ultimate 64-bit, Intel Core i7, 8 GB RAM):

ostringstream: 73.4 milliseconds, 71.6 ms
stringbuf: 21.7 ms, 21.3 ms
vector<char> and back_inserter: 34.6 ms, 34.4 ms
vector<char> with ordinary iterator: 1.10 ms, 1.04 ms
vector<char> iterator and bounds check: 1.11 ms, 0.87 ms, 1.12 ms, 0.89 ms, 1.02 ms, 1.14 ms
char[]: 1.48 ms, 1.57 ms

Visual C++ 2010 x86, with Profile-Guided Optimization cl /Ox /EHsc /GL /c, link /ltcg:pgi, run, link /ltcg:pgo, measure:

ostringstream: 61.2 ms, 60.5 ms
vector<char> with ordinary iterator: 1.04 ms, 1.03 ms

Same laptop, same OS, using cygwin gcc 4.3.4 g++ -O3:

ostringstream: 62.7 ms, 60.5 ms
stringbuf: 44.4 ms, 44.5 ms
vector<char> and back_inserter: 13.5 ms, 13.6 ms
vector<char> with ordinary iterator: 4.1 ms, 3.9 ms
vector<char> iterator and bounds check: 4.0 ms, 4.0 ms
char[]: 3.57 ms, 3.75 ms

Same laptop, Visual C++ 2008 SP1, cl /Ox /EHsc:

ostringstream: 88.7 ms, 87.6 ms
stringbuf: 23.3 ms, 23.4 ms
vector<char> and back_inserter: 26.1 ms, 24.5 ms
vector<char> with ordinary iterator: 3.13 ms, 2.48 ms
vector<char> iterator and bounds check: 2.97 ms, 2.53 ms
char[]: 1.52 ms, 1.25 ms

Same laptop, Visual C++ 2010 64-bit compiler:

ostringstream: 48.6 ms, 45.0 ms
stringbuf: 16.2 ms, 16.0 ms
vector<char> and back_inserter: 26.3 ms, 26.5 ms
vector<char> with ordinary iterator: 0.87 ms, 0.89 ms
vector<char> iterator and bounds check: 0.99 ms, 0.99 ms
char[]: 1.25 ms, 1.24 ms

EDIT: Ran all twice to see how consistent the results were. Pretty consistent IMO.

NOTE: On my laptop, since I can spare more CPU time than ideone allows, I set the number of iterations to 1000 for all methods. This means that ostringstream and vector reallocation, which takes place only on the first pass, should have little impact on the final results.

EDIT: Oops, found a bug in the vector-with-ordinary-iterator, the iterator wasn't being advanced and therefore there were too many cache hits. I was wondering how vector<char> was outperforming char[]. It didn't make much difference though, vector<char> is still faster than char[] under VC++ 2010.

Conclusions

Buffering of output streams requires three steps each time data is appended:

Check that the incoming block fits the available buffer space.
Copy the incoming block.
Update the end-of-data pointer.

The latest code snippet I posted, "vector<char> simple iterator plus bounds check" not only does this, it also allocates additional space and moves the existing data when the incoming block doesn't fit. As Clifford pointed out, buffering in a file I/O class wouldn't have to do that, it would just flush the current buffer and reuse it. So this should be an upper bound on the cost of buffering output. And it's exactly what is needed to make a working in-memory buffer.

So why is stringbuf 2.5x slower on ideone, and at least 10 times slower when I test it? It isn't being used polymorphically in this simple micro-benchmark, so that doesn't explain it.

Best Answer

Not answering the specifics of your question so much as the title: the 2006 Technical Report on C++ Performance has an interesting section on IOStreams (p.68). Most relevant to your question is in Section 6.1.2 ("Execution Speed"):

Since certain aspects of IOStreams processing are distributed over multiple facets, it appears that the Standard mandates an inefficient implementation. But this is not the case — by using some form of preprocessing, much of the work can be avoided. With a slightly smarter linker than is typically used, it is possible to remove some of these inefficiencies. This is discussed in §6.2.3 and §6.2.5.

Since the report was written in 2006 one would hope that many of the recommendations would have been incorporated into current compilers, but perhaps this is not the case.

As you mention, facets may not feature in write() (but I wouldn't assume that blindly). So what does feature? Running GProf on your ostringstream code compiled with GCC gives the following breakdown:

44.23% in std::basic_streambuf<char>::xsputn(char const*, int)
34.62% in std::ostream::write(char const*, int)
12.50% in main
6.73% in std::ostream::sentry::sentry(std::ostream&)
0.96% in std::string::_M_replace_safe(unsigned int, unsigned int, char const*, unsigned int)
0.96% in std::basic_ostringstream<char>::basic_ostringstream(std::_Ios_Openmode)
0.00% in std::fpos<int>::fpos(long long)

So the bulk of the time is spent in xsputn, which eventually calls std::copy() after lots of checking and updating of cursor positions and buffers (have a look in c++\bits\streambuf.tcc for the details).

My take on this is that you've focused on the worst-case situation. All the checking that is performed would be a small fraction of the total work done if you were dealing with reasonably large chunks of data. But your code is shifting data in four bytes at a time, and incurring all the extra costs each time. Clearly one would avoid doing so in a real-life situation - consider how negligible the penalty would have been if write was called on an array of 1m ints instead of on 1m times on one int. And in a real-life situation one would really appreciate the important features of IOStreams, namely its memory-safe and type-safe design. Such benefits come at a price, and you've written a test which makes these costs dominate the execution time.

Related Solutions

C++ – What does the C++ standard state the size of int, long type to be

The C++ standard does not specify the size of integral types in bytes, but it specifies minimum ranges they must be able to hold. You can infer minimum size in bits from the required range. You can infer minimum size in bytes from that and the value of the CHAR_BIT macro that defines the number of bits in a byte. In all but the most obscure platforms it's 8, and it can't be less than 8.

One additional constraint for char is that its size is always 1 byte, or CHAR_BIT bits (hence the name). This is stated explicitly in the standard.

The C standard is a normative reference for the C++ standard, so even though it doesn't state these requirements explicitly, C++ requires the minimum ranges required by the C standard (page 22), which are the same as those from Data Type Ranges on MSDN:

signed char: -127 to 127 (note, not -128 to 127; this accommodates 1's-complement and sign-and-magnitude platforms)
unsigned char: 0 to 255
"plain" char: same range as signed char or unsigned char, implementation-defined
signed short: -32767 to 32767
unsigned short: 0 to 65535
signed int: -32767 to 32767
unsigned int: 0 to 65535
signed long: -2147483647 to 2147483647
unsigned long: 0 to 4294967295
signed long long: -9223372036854775807 to 9223372036854775807
unsigned long long: 0 to 18446744073709551615

A C++ (or C) implementation can define the size of a type in bytes sizeof(type) to any value, as long as

the expression sizeof(type) * CHAR_BIT evaluates to a number of bits high enough to contain required ranges, and
the ordering of type is still valid (e.g. sizeof(int) <= sizeof(long)).

Putting this all together, we are guaranteed that:

char, signed char, and unsigned char are at least 8 bits
signed short, unsigned short, signed int, and unsigned int are at least 16 bits
signed long and unsigned long are at least 32 bits
signed long long and unsigned long long are at least 64 bits

No guarantee is made about the size of float or double except that double provides at least as much precision as float.

The actual implementation-specific ranges can be found in <limits.h> header in C, or <climits> in C++ (or even better, templated std::numeric_limits in <limits> header).

For example, this is how you will find maximum range for int:

#include <limits.h>
const int min_int = INT_MIN;
const int max_int = INT_MAX;

C++:

#include <limits>
const int min_int = std::numeric_limits<int>::min();
const int max_int = std::numeric_limits<int>::max();

Java – Does use of final keyword in Java improve the performance

Usually not. For virtual methods, HotSpot keeps track of whether the method has actually been overridden, and is able to perform optimizations such as inlining on the assumption that a method hasn't been overridden - until it loads a class which overrides the method, at which point it can undo (or partially undo) those optimizations.

(Of course, this is assuming you're using HotSpot - but it's by far the most common JVM, so...)

To my mind you should use final based on clear design and readability rather than for performance reasons. If you want to change anything for performance reasons, you should perform appropriate measurements before bending the clearest code out of shape - that way you can decide whether any extra performance achieved is worth the poorer readability/design. (In my experience it's almost never worth it; YMMV.)

EDIT: As final fields have been mentioned, it's worth bringing up that they are often a good idea anyway, in terms of clear design. They also change the guaranteed behaviour in terms of cross-thread visibility: after a constructor has completed, any final fields are guaranteed to be visible in other threads immediately. This is probably the most common use of final in my experience, although as a supporter of Josh Bloch's "design for inheritance or prohibit it" rule of thumb, I should probably use final more often for classes...