How Preallocating Cell Arrays Improves Performance in MATLAB

MATLABperformance

I was reading this article on MathWorks about improving MATLAB performance and you will notice that one of the first suggestions is to preallocate arrays, which makes sense. But it also says that preallocating Cell arrays (that is arrays which may contain different, unknown datatypes) will improve performance.

But how will doing so improve performance because the datatypes are unknown so it doesn't know how much contiguous memory it will require even if it knows the shape of the cell array, and therefore it can't preallocate the memory surely? So how does this result in any improvement in performance?

I apologise if this question is better suited for StackOverflow than Programmers but it isn't asking about a specific problem so I thought it fit better here, please let me know if I am mistaken though.

Any explanation would be greatly appreciated 🙂

Best Answer

I don't know details of MATLAB's memory handling, so this is just an educated guess: Cell arrays are implemented so that the array itself contains only references (pointers) to the cells, which actually live in the heap. It definitely can't allocate memory for the actual cells in advance because, as you wrote, their size is unknown. However, it can pre-allocate the pointer array itself, since the size of the pointers is known.

When you think about it, it would be quite difficult to implement an array whose element size wouldn't be constant: how would you know where in the memory X[1234] lives, if the size of each element can be different? Therefore a layer of indirection (store constant-sized pointers pointing to the actual data) is quite useful. An alternative would be some sort of linked list, a different kind of trade-off.

Know your big-O

This should probably be merged into the lengthy discussion above. It's pretty much common sense that a loop inside of a loop, where the inner loop repeats a calculation, is gonna be slower. For example:

for (i = 0; i < strlen(str); i++) {
    ...
}

This will take a horrendous amount of time if the string is really long, because the length is being recalculated on every iteration of the loop. Note that GCC actually optimizes this case because strlen() is marked as a pure function.

When sorting a million 32-bit integers, bubble sort would be the wrong way to go. In general, sorting can be done in O(n * log n) time (or better, in the case of radix sort), so unless you know your data is going to be small, look for an algorithm that's at least O(n * log n).

Likewise, when dealing with databases, be aware of indexes. If you SELECT * FROM people WHERE age = 20, and you don't have an index on people(age), it'll require an O(n) sequential scan rather than a much faster O(log n) index scan.

Integer arithmetic hierarchy

When programming in C, bear in mind that some arithmetic operations are more expensive than others. For integers, the hierarchy goes something like this (least expensive first):

+ - ~ & | ^
<< >>
*
/

Granted, the compiler will usually optimize things like n / 2 to n >> 1 automatically if you're targeting a mainstream computer, but if you're targeting an embedded device, you might not get that luxury.

Also, % 2 and & 1 have different semantics. Division and modulus usually rounds toward zero, but it's implementation defined. Good ol' >> and & always rounds toward negative infinity, which (in my opinion) makes a lot more sense. For instance, on my computer:

printf("%d\n", -1 % 2); // -1 (maybe)
printf("%d\n", -1 & 1); // 1

Hence, use what makes sense. Don't think you're being a good boy by using % 2 when you were originally going to write & 1.

Expensive floating point operations

Avoid heavy floating point operations like pow() and log() in code that doesn't really need them, especially when dealing with integers. Take, for example, reading a number:

int parseInt(const char *str)
{
    const char *p;
    int         digits;
    int         number;
    int         position;

    // Count the number of digits
    for (p = str; isdigit(*p); p++)
        {}
    digits = p - str;

    // Sum the digits, multiplying them by their respective power of 10.
    number = 0;
    position = digits - 1;
    for (p = str; isdigit(*p); p++, position--)
        number += (*p - '0') * pow(10, position);

    return number;
}

Not only is this use of pow() (and the int<->double conversions needed to use it) rather expensive, but it creates an opportunity for precision loss (incidentally, the code above doesn't have precision issues). That's why I wince when I see this type of function used in a non-mathematical context.

Also, notice how the "clever" algorithm below, which multiplies by 10 on each iteration, is actually more concise than the code above:

int parseInt(const char *str)
{
    const char *p;
    int         number;

    number = 0;
    for (p = str; isdigit(*p); p++) {
        number *= 10;
        number += *p - '0';
    }

    return number;
}

Java Performance – How to Significantly Improve Java Performance

There are all kinds of techniques for high-performance transaction processing and the one in Fowler's article is just one of many at the bleeding edge. Rather than listing a bunch of techniques which may or may not be applicable to anyone's situation, I think it's better to discuss the basic principles and how LMAX addresses a large number of them.

For a high-scale transaction processing system you want to do all of the following as much as possible:

Minimize time spent in the slowest storage tiers. From fastest to slowest on a modern server you have: CPU/L1 -> L2 -> L3 -> RAM -> Disk/LAN -> WAN. The jump from even the fastest modern magnetic disk to the slowest RAM is over 1000x for sequential access; random access is even worse.
Minimize or eliminate time spent waiting. This means sharing as little state as possible, and, if state must be shared, avoiding explicit locks whenever possible.
Spread the workload. CPUs haven't gotten much faster in the past several years, but they have gotten smaller, and 8 cores is pretty common on a server. Beyond that, you can even spread the work over multiple machines, which is Google's approach; the great thing about this is that it scales everything including I/O.

According to Fowler, LMAX takes the following approach to each of these:

Keep all state in memory at all times. Most database engines will actually do this anyway, if the entire database can fit in memory, but they don't want to leave anything up to chance, which is understandable on a real-time trading platform. In order to pull this off without adding a ton of risk, they had to build a bunch of lightweight backup and failover infrastructure.
Use a lock-free queue ("disruptor") for the stream of input events. Contrast to traditional durable message queues which are definitively not lock free, and in fact usually involve painfully-slow distributed transactions.
Not much. LMAX throws this one under the bus on the basis that workloads are interdependent; the outcome of one changes the parameters for the others. This is a critical caveat, and one which Fowler explicitly calls out. They do make some use of concurrency in order to provide failover capabilities, but all of the business logic is processed on a single thread.

LMAX is not the only approach to high-scale OLTP. And although it's quite brilliant in its own right, you do not need to use bleeding-edge techniques in order to pull off that level of performance.

Of all of the principles above, #3 is probably the most important and the most effective, because, frankly, hardware is cheap. If you can properly partition the workload across half a dozen cores and several dozen machines, then the sky's the limit for conventional Parallel Computing techniques. You'd be surprised how much throughput you can pull off with nothing but a bunch of message queues and a round-robin distributor. It's obviously not as efficient as LMAX - actually not even close - but throughput, latency, and cost-effectiveness are separate concerns, and here we're talking specifically about throughput.

If you have the same sort of special needs that LMAX does - in particular, a shared state which corresponds to a business reality as opposed to a hasty design choice - then I'd suggest trying out their component, because I haven't seen much else that's suited to those requirements. But if we're simply talking about high scalability then I'd urge you to do more research into distributed systems, because they are the canonical approach used by most organizations today (Hadoop and related projects, ESB and related architectures, CQRS which Fowler also mentions, and so on).

SSDs are also going to become a game-changer; arguably, they already are. You can now have permanent storage with similar access times to RAM, and although server-grade SSDs are still horribly expensive, they will eventually come down in price once adoption rates grow. It's been researched extensively and the results are pretty mind-boggling and will only get better over time, so the whole "keep everything in memory" concept is a lot less important than it used to be. So once again, I'd try to focus on concurrency whenever possible.

Best Answer

Related Solutions

What simple techniques do you use to improve performance

Know your big-O

Integer arithmetic hierarchy

Expensive floating point operations

Java Performance – How to Significantly Improve Java Performance

Related Topic