I both agree and disagree with your father. Performance should be thought about early, but micro-optimization should only be thought about early if you actually know that a high percent of time will be spent in small CPU-bound sections of code.
The problem with micro-optimization is that it is usually done without having any concept of how programs actually spend more time than necessary.
This knowledge comes from experience doing performance tuning, as in this example, in which a seemingly straightforward program, with no obvious inefficiencies, is taken through a series of diagnosis and speedup steps, until it is 43 times faster than at the beginning.
What it shows is that you cannot really guess or intuit where the problems will be. If you perform diagnosis, which in my case is random-pausing, lines of code responsible for a significant fraction of time are preferentially exposed. If you look at those, you may find substitute code, and thereby reduce overall time by roughly that fraction.
Other things you didn't fix still take as much time as they did before, but since the overall time has been reduced, those things now take a larger fraction, so if you do it all again, that fraction can also be eliminated. If you keep doing this over multiple iterations, that's how you can get massive speedups, without ever necessarily having done any micro-optimization.
After that kind of experience, when you approach new programming problems, you come to recognize the design approaches that initially lead to such inefficiencies. In my experience, it comes from over-design of data structure, non-normalized data structure, massive reliance on notifications, that sort of thing.
I don't know details of MATLAB's memory handling, so this is just an educated guess: Cell arrays are implemented so that the array itself contains only references (pointers) to the cells, which actually live in the heap. It definitely can't allocate memory for the actual cells in advance because, as you wrote, their size is unknown. However, it can pre-allocate the pointer array itself, since the size of the pointers is known.
When you think about it, it would be quite difficult to implement an array whose element size wouldn't be constant: how would you know where in the memory X[1234] lives, if the size of each element can be different? Therefore a layer of indirection (store constant-sized pointers pointing to the actual data) is quite useful. An alternative would be some sort of linked list, a different kind of trade-off.
Best Answer
What is the "ninja performance gap?"
A naive approach finishing many tasks might be to start one task, wait for it to finish, and only then to start the next task. Suppose one wanted to retrieve many networked resources. The naive approach would be to send a request for the first resource, wait for the request to complete, then send the next.
If a good programmer is skilled at concurrency, such a programmer could write a program that executes many actions at a single time using all the resources of the computer on which it runs (processors, system memory, etc.). Using the network resource example above, the optimal approach would be to attempt to get many resources at the same time.
The gap between the naive approach and the most optimal approach is what these authors (and apparently others) are calling the "ninja performance gap."
Were these words chosen well?
They were likely chosen for promotional and novelty value. As the phenomenon appears to be a very real problem, I would have preferred a term that focused on the problem as opposed to the actors on one side of the gap.
Perhaps "optimized concurrency gap" would have been a better choice in words.
Why is it so large?
The naive approach to solving many problems with computers can be quite inefficient relative to known improvements.
My own experience with Project Euler demonstrated this to me, with some solutions not completing in any reasonable amount of time (minutes, hours), but other solutions to the same problem finishing blazingly fast (within a few seconds).
Bubble sort is known to be quite inefficient relative to other sorting algorithms.
Do we question the size of it?
Based on the above knowledge, I do not question the size as stated.
How can it be overcome?
The naive solution would be for the programmer to study threading and all of the relevant skills to close this gap. Closing the gap in this manner would take a lot of time and effort, both to gain the skills and to write the programs.
How do we learn to overcome it? Should we learn skillful concurrency or learn algorithms?
The paper's abstract states that using better algorithms combined with better compilers can relatively minimize the gap with much less effort required. I am inclined, without direct knowledge, to believe them.