C++ Programming – Using size_t or int for Dimensions and Indexes

arrayc

In C++, size_t (or, more correctly T::size_type which is "usually" size_t; i.e., a unsigned type) is used as the return value for size(), the argument to operator[], etc. (see std::vector, et. al.)

On the other hand, .NET languages use int (and, optionally, long) for the same purpose; in fact, CLS-compliant languages are not required to support unsigned types.

Given that .NET is newer than C++, something tells me that there may be problems using unsigned int even for things that "can't possibly" be negative like an array index or length. Is the C++ approach "historical artifact" for backwards compatibility? Or are there real and significant design tradeoffs between the two approaches?

Why does this matter? Well … what should I use for a new multi-dimensional class in C++; size_t or int?

struct Foo final // e.g., image, matrix, etc.
{
    typedef int32_t /* or int64_t*/ dimension_type; // *OR* always "size_t" ?
    typedef size_t size_type; // c.f., std::vector<>

    dimension_type bar_; // maybe rows, or x
    dimension_type baz_; // e.g., columns, or y

    size_type size() const { ... } // STL-like interface
};

Best Answer

Given that .NET is newer than C++, something tells me that there may be problems using unsigned int even for things that "can't possibly" be negative like an array index or length.

Yes. For certain types of applications such as image processing or array processing, it is often necessary to access elements relative to the current position:

sum = data[k - 2] + data[k - 1] + data[k] + data[k + 1] + ...

In these types of applications, you cannot perform range check with unsigned integers without thinking carefully:

if (k - 2 < 0) {
    throw std::out_of_range("will never be thrown"); 
}

if (k < 2) {
    throw std::out_of_range("will be thrown"); 
}

if (k < 2uL) {
    throw std::out_of_range("will be thrown, without signedness ambiguity"); 
}

Instead you have to rearrange your range check expression. That is the main difference. Programmers must also remember the integer conversion rules. When in doubt, re-read http://en.cppreference.com/w/cpp/language/operator_arithmetic#Conversions

A lot of applications do not need to use very large array indices, but they do need to perform range checks. Furthermore, a lot of programmers are not trained to do this expression rearrangement gymnastics. A single missed opportunity opens the door to an exploit.

C# is indeed designed for those applications that will not need more than 2^31 elements per array. For example, a spreadsheet application does not need to deal with that many rows, columns, or cells. C# deals with the upper limit by having optional checked arithmetic that can be enabled for a block of code with a keyword without messing with compiler options. For this reason, C# favors the use of signed integer. When these decisions are considered altogether, it makes good sense.

C++ is simply different, and is harder to get correct code.

Regarding the practical importance of allowing signed arithmetic to remove a potential violation of "principle of least astonishment", a case in point is OpenCV, which uses signed 32-bit integer for matrix element index, array size, pixel channel count, etc. Image processing is an example of programming domain that uses relative array index heavily. Unsigned integer underflow (negative result wrapped around) will severely complicate algorithm implementation.

Related Solutions

C++ – Choosing the type of Index Variables

vector has a typedef that tells you the correct type to use :-

for(std::vector<int>::size_type i = 0; i < thing.size(); ++i)
{
}

It's almost always defined to be size_t though but you can't rely on that

C – Why Are Both index[array] and array[index] Valid?

First of all, it would help to read dmr's Development of the C Language to get some insights into some of C's quirks, particularly when it comes to array semantics (basically, blame BCPL and B for most of it).

As for the question "[w]hy not just enforce that index[array] is invalid, for clarity's sake," what would such a check buy you in exchange for the cost of performing it? The form almost never appears outside of the IOCCC, so it's not like it's a major problem in production code (compared to the use of, say, gets, or unchecked array accesses (which disallowing i[a] won't help with), or <fill in the blank>). It's not a bug; it doesn't introduce any undefined behavior; it doesn't introduce any security holes not already present with a[i]; the only complaints against it are stylistic in nature.

It's like asking why both T *p and T* p are valid; there is no "why" beyond it being an accident of the language syntax. There's nothing deliberate behind allowing both, it's just a function of how the grammar works. Same with a[i] and i[a]. Professional programmers are (usually) grown-ups, and don't deliberately introduce confusion where it isn't warranted, so most will naturally use a[i].

You're basically trying to guard against a problem that doesn't really exist.

Best Answer

Related Solutions

C++ – Choosing the type of Index Variables

C – Why Are Both index[array] and array[index] Valid?

Related Topic