C++ – 2D linked list vs. multidimensional array/vector

arraycdata structureslinked list

I hope that programmers is the correct stack exchange for this, as it is quite a concept based question.

I'm working on a data structure in C++ that is a represents data in 3D space. The x-y plane is large (say 0 – 10 000) while the z-axis has a very small range (0 – 10). All the data is located on integer (x,y,z) points. However there can be more than one than one data point on an (x,y,z) coordinate.

I have thought of three ways of approaching this subject, and don't have the necessary experience to choose between them.

  1. Using a 2D linked list inside vectors (dynamic arrays) vector<vector<structureA*>> Data where the first list item for that (x,y) point can be accessed using Data[x][y]. structureA contains an its z value, its actual data, as well as a pointer to the next structure in the list (that may, or may not have a different z coordinate, but is located on the same line defined by x and y). Accessing elements at the middle and end of this list is slow.

  2. Using a three dimensional array/vector: vector<vector<vector<structureB>>> Data. In this case the first two vectors define the x and y coordinates (as before) and the third vector gives us an array of all of the elements on that line defined by x and y. We can quickly determine how many elements are on this line (by querying the third vector) and can iterate from the back and the front. structureB contains, the point's z value and its data. This seems like a fairly good and safe (regarding memory allocation) solution to me.

  3. Using a three dimensional array/vector and a linked list. In this case the data could be accessed as such: Data[x][y][z] and would point to the first element of a linked list of data that resides at this coordinate. The structure pointed to would not contain its z value here as this is already dealt with in its address in the array. However, many (x,y,z) points have no data, whereas most of those that do have data have 10+ data structures. (All lines defined by a specific (x,y) contain at least one structure). This seems the worst solution to me as in many cases I would have to iterate through empty z coordinates, before hitting a linked list to iterate through.

The emphasis in this program is speed, my data set is less than 100MB on a modern computer, so even an inefficient method of storing it in RAM is fine. Option 2 seems the best to me, although in the past I've normally seen people more inclined to go with option 1. I'm interested to hear your opinions on the best way to handle this along with your reasoning.
Many Thanks.

Update

Thanks for the responses, I stand to learn quite a lot from your comments as I am a hobby programmer, having taken only one short course on the fundamentals of C a few years back.

@MorphingDragon The array is not jagged, all z "slices" have the same range of x and y. The values also don't seem sparse enough for a hash table (which IIRC slow down significantly on hash collision – which is guaranteed as multiple elements have the same (x,y,z)). The dimensions of the set are known on file load (they are included in the file header along with an md5 hash of the file for error checking and a few bit relevant to the analysis).

@J Trana / Florian F The data set is pre-built and is infrequently modified by the program – a piece of data may need to be moved or deleted. For adjusting data particularly just adjusting its z-value (or removing it), which are the most frequent operations. The linked list will be faster as these operations are O(1) as opposed to O(n) in a vector. However, I believe these changes are infrequent enough that having O(n) for changes may be worthwhile if the lookup speed is even marginally faster (as this is the more frequent operation).

The majority of the time the data will be accessed iteratively, taking an n*p slice from the x-y plane. Pseudocode example:

for each y between startY and (startY + p)
  for each x between startX and (startX + n)
    read all nodes in Data[x][y] //gets nodes for each z on an (x,y) line
    analyse(&nodes) //possibly modify the nodes, possibly just compute something based on their values

Data access is always related to the (x,y) line an element is on, but we don't always need all the nodes on that line (dependant on their z). So we either have a structure that doesn't require us to retrieve all of them (just the ones we want) or we perform a secondary sort. The data on file represents a sorted list in ascending z for each (x,y).

I will also look into those other data structures. I've never met them, as with very little training, my method is to use STL vectors or lists for everything. I'd like to spend the time to do this properly this time though.

I spent a fair amount of time thinking about this last night after posting the question and seeing the first responses. I realised that "if one piece of data from a point (x,y,z) is required, then all elements from that point are required at the same time." This has led me to think that option 3, my least preferred when I started this yesterday afternoon, is actually the best of the options I've enumerated. If I use a three-dimensional dynamic array of pointers to linked list, I don't exactly lose much speed or heap space setting a pointer to NULL rather than attaching a struct. This also allows full access to any 3d, which while not currently required may reduce some operations from O(n) to O(1).

Having had a brief look over octaves/r-trees, I feel safe saying that my data set is not complex enough to require such an engineering heavy solution. When it comes down to it, i am fairly sure that any of the three options I outlined would work, even if not the fastest, but I would like to learn a good and relatively efficient method, rather than just relying on the awesome power of modern computers to brute force a problem like I normally do.
Thanks for the replies and suggestions so far. 🙂

Best Answer

I hope it's ok to answer my own question. I believe I have found the optimal (without overcomplicating the problem) data structure for my problem. There was at least minor idiocy on my part for not recognising this earlier. The data doesn't need to be accessed by (x,y,z) but instead by (x, y, range of z (say 0 - 3)). This give a C++ struct as follows:

struct node {
  struct node *next;
  int zGroup;
  int z;
  50 bytes of misc data };

I can then address this through a 3D dynamic array (vectors):

vector< vector < vector < node* > > > Data;

Any given Data[x][y][zGroup] points to the first element of a linked list, the entirety of which is needed every time one element of it is needed. No value of this array is NULL, every one contains a linked-list of at least one element.

The third dimension of the array - the zGroup has jagged dimensions, but with dynamic arrays this isn't an issue. Given the data and computations being performed on it, I know that the max x and y values are set when the file is read and do not change, neither does the number of z groups on any given (x,y) line, the actual z-values of nodes may change, but they will remain inside the same z-groups, giving a constant-sized, fully populated array.

With the way that the file is structured it is also easy enough to page it in and out of memory if I am brought to do this with much larger data sets.