C# Noncontiguous Arrays – Performance Analysis

c

In C#, when a user creates an List<byte> and adds bytes to it, there is a chance it runs out of space and needs to allocate more space. It allocates double (or some other multiplier) the size of the previous array, copies the bytes over and discards the reference to the old array. I know that the list grows exponentially because each allocation is expensive and this limits it to O(log n) allocations, where just adding 10 extra items each time would result in O(n) allocations.

However for large array sizes there can be a lot of wasted space, maybe nearly half the array. To reduce memory I wrote a similar class NonContiguousArrayList which uses List<byte> as a backing store if there were less than 4MB in the list, then it would allocate additional 4MB byte arrays as NonContiguousArrayList grew in size.

Unlike List<byte> these arrays are non-contiguous so there is no copying of data around, just an additional 4M allocation. When an item is looked up, the index is divided by 4M to get the index of the array containing the item, then modulo 4M to get the index within the array.

Can you point out problems with this approach? Here is my list:

  • Noncontiguous arrays don't have cache locality which results in bad performance. However at a 4M block size it seems like there would be enough locality for good caching.
  • Accessing an item is not quite as simple, there's an extra level of indirection. Would this get optimized away? Would it cause cache problems?
  • Since there is linear growth after the 4M limit is hit, you could have many more allocations than you would ordinarily (say, max of 250 allocations for 1GB of memory). No extra memory is copied after 4M, however I'm not sure if the extra allocations is more expensive than copying large chunks of memory.

Best Answer

At the scales you mentioned, the concerns are totally different from those you have mentioned.

Cache locality

  • There are two related concepts:
    1. Locality, the reuse of data on the same cache line (spatial locality) that was recently visited (temporal locality)
    2. Automatic cache prefetching (streaming).
  • At the scales you mentioned (hundred MBs to gigabytes, in 4MB chunks), the two factors have more to do with your data element access pattern than the memory layout.
  • My (clueless) prediction is that statistically there might not be much performance difference than a giant contiguous memory allocation. No gain, no loss.

Data element access pattern

  • This article visually illustrates how memory access patterns will affect performance.
  • In short, just keep in mind that if your algorithm is already bottlenecked by memory bandwidth, the only way to improve performance is to do more useful work with the data that is already loaded into the cache.
  • In other words, even if YourList[k] and YourList[k+1] have a high probability of being consecutive (one in four-million chance of being not), that fact will not help performance if you access your list completely randomly, or in large unpredictable strides e.g. while { index += random.Next(1024); DoStuff(YourList[index]); }

Interaction with the GC system

  • In my opinion, this is where you should focus on most.
  • At the minimum, understand how your design will interact with:
  • I am not knowledgeable in these topics so I will leave others to contribute.

Overhead of address offset calculations

  • Typical C# code is already doing a lot of address offset calculations, so the additional overhead from your scheme wouldn't be any worse than the typical C# code working on a single array.
    • Remember that C# code also does array range checking; and this fact does not prevent C# from reaching comparable array processing performance with C++ code.
    • The reason is that performance is mostly bottlenecked by memory bandwidth.
    • The trick to maximizing utility from memory bandwidth is to use SIMD instructions for memory read/write operations. Neither typical C# nor typical C++ does this; you have to resort to libraries or language add-ons.

To illustrate why:

  • Do address calculation
  • (In OP's case, load chunk base address (which is already in cache) and then do more address calculation)
  • Read from / write to the element address

The last step still takes the lion's share of time.

Personal suggestion

  • You can provide a CopyRange function, which would behave like Array.Copy function but would operate between two instances of your NonContiguousByteArray, or between one instance and another normal byte[]. These functions can make use of SIMD code (C++ or C#) for maximizing memory bandwidth utilization, and then your C# code can operate on the copied range without the overhead of multiple dereferencing or address calculation.

Usability and interoperability concerns

  • Apparently you cannot use this NonContiguousByteArray with any C#, C++ or foreign-language libraries that expect contiguous byte arrays, or byte arrays that can be pinned.
  • However, if you write your own C++ acceleration library (with P/Invoke or C++/CLI), you can pass in a list of base addresses of several 4MB blocks into the underlying code.
    • For example, if you need to give access to elements starting at (3 * 1024 * 1024) and ending at (5 * 1024 * 1024 - 1), this means the access will span across chunk[0] and chunk[1]. You can then construct an array (size 2) of byte arrays (size 4M), pin these chunk addresses and pass them to the underlying code.
  • Another usability concern is that you will not be able to implement the IList<byte> interface efficiently: Insert and Remove will just take too long to process because they will require O(N) time.
    • In fact, it looks like you can't implement anything other than IEnumerable<byte>, i.e. it can be scanned sequentially and that's it.
Related Topic