Performance Optimization – When to Optimize for Memory vs Speed

functionsmemorymemory usageperformancespeed

I recently interviewed at Amazon. During a coding session, the interviewer asked why I declared a variable in a method. I explained my process and he challenged me to solve the same problem with fewer variables. For example (this wasn't from the interview), I started with Method A then improved it to Method B, by removing int s. He was pleased and said this would reduce memory usage by this method.

I understand the logic behind it, but my question is:

When is it appropriate to use Method A vs. Method B, and vice versa?

You can see that Method A is going to have higher memory usage, since int s is declared, but it only has to perform one calculation, i.e. a + b. On the other hand, Method B has lower memory usage, but has to perform two calculations, i.e. a + b twice. When do I use one technique over the other? Or, is one of the techniques always preferred over the other? What are things to consider when evaluating the two methods?

Method A:

private bool IsSumInRange(int a, int b)
{
    int s = a + b;

    if (s > 1000 || s < -1000) return false;
    else return true;
}

Method B:

private bool IsSumInRange(int a, int b)
{
    if (a + b > 1000 || a + b < -1000) return false;
    else return true;
}

Best Answer

Instead of speculating about what may or may not happen, let's just look, shall we? I'll have to use C++ since I don't have a C# compiler handy (though see the C# example from VisualMelon), but I'm sure the same principles apply regardless.

We'll include the two alternatives you encountered in the interview. We'll also include a version that uses abs as suggested by some of the answers.

#include <cstdlib>

bool IsSumInRangeWithVar(int a, int b)
{
    int s = a + b;

    if (s > 1000 || s < -1000) return false;
    else return true;
}

bool IsSumInRangeWithoutVar(int a, int b)
{
    if (a + b > 1000 || a + b < -1000) return false;
    else return true;
}

bool IsSumInRangeSuperOptimized(int a, int b) {
    return (abs(a + b) < 1000);
}

Now compile it with no optimization whatsoever: g++ -c -o test.o test.cpp

Now we can see precisely what this generates: objdump -d test.o

0000000000000000 <_Z19IsSumInRangeWithVarii>:
   0:   55                      push   %rbp              # begin a call frame
   1:   48 89 e5                mov    %rsp,%rbp
   4:   89 7d ec                mov    %edi,-0x14(%rbp)  # save first argument (a) on stack
   7:   89 75 e8                mov    %esi,-0x18(%rbp)  # save b on stack
   a:   8b 55 ec                mov    -0x14(%rbp),%edx  # load a and b into edx
   d:   8b 45 e8                mov    -0x18(%rbp),%eax  # load b into eax
  10:   01 d0                   add    %edx,%eax         # add a and b
  12:   89 45 fc                mov    %eax,-0x4(%rbp)   # save result as s on stack
  15:   81 7d fc e8 03 00 00    cmpl   $0x3e8,-0x4(%rbp) # compare s to 1000
  1c:   7f 09                   jg     27                # jump to 27 if it's greater
  1e:   81 7d fc 18 fc ff ff    cmpl   $0xfffffc18,-0x4(%rbp) # compare s to -1000
  25:   7d 07                   jge    2e                # jump to 2e if it's greater or equal
  27:   b8 00 00 00 00          mov    $0x0,%eax         # put 0 (false) in eax, which will be the return value
  2c:   eb 05                   jmp    33 <_Z19IsSumInRangeWithVarii+0x33>
  2e:   b8 01 00 00 00          mov    $0x1,%eax         # put 1 (true) in eax
  33:   5d                      pop    %rbp
  34:   c3                      retq

0000000000000035 <_Z22IsSumInRangeWithoutVarii>:
  35:   55                      push   %rbp
  36:   48 89 e5                mov    %rsp,%rbp
  39:   89 7d fc                mov    %edi,-0x4(%rbp)
  3c:   89 75 f8                mov    %esi,-0x8(%rbp)
  3f:   8b 55 fc                mov    -0x4(%rbp),%edx
  42:   8b 45 f8                mov    -0x8(%rbp),%eax  # same as before
  45:   01 d0                   add    %edx,%eax
  # note: unlike other implementation, result is not saved
  47:   3d e8 03 00 00          cmp    $0x3e8,%eax      # compare to 1000
  4c:   7f 0f                   jg     5d <_Z22IsSumInRangeWithoutVarii+0x28>
  4e:   8b 55 fc                mov    -0x4(%rbp),%edx  # since s wasn't saved, load a and b from the stack again
  51:   8b 45 f8                mov    -0x8(%rbp),%eax
  54:   01 d0                   add    %edx,%eax
  56:   3d 18 fc ff ff          cmp    $0xfffffc18,%eax # compare to -1000
  5b:   7d 07                   jge    64 <_Z22IsSumInRangeWithoutVarii+0x2f>
  5d:   b8 00 00 00 00          mov    $0x0,%eax
  62:   eb 05                   jmp    69 <_Z22IsSumInRangeWithoutVarii+0x34>
  64:   b8 01 00 00 00          mov    $0x1,%eax
  69:   5d                      pop    %rbp
  6a:   c3                      retq

000000000000006b <_Z26IsSumInRangeSuperOptimizedii>:
  6b:   55                      push   %rbp
  6c:   48 89 e5                mov    %rsp,%rbp
  6f:   89 7d fc                mov    %edi,-0x4(%rbp)
  72:   89 75 f8                mov    %esi,-0x8(%rbp)
  75:   8b 55 fc                mov    -0x4(%rbp),%edx
  78:   8b 45 f8                mov    -0x8(%rbp),%eax
  7b:   01 d0                   add    %edx,%eax
  7d:   3d 18 fc ff ff          cmp    $0xfffffc18,%eax
  82:   7c 16                   jl     9a <_Z26IsSumInRangeSuperOptimizedii+0x2f>
  84:   8b 55 fc                mov    -0x4(%rbp),%edx
  87:   8b 45 f8                mov    -0x8(%rbp),%eax
  8a:   01 d0                   add    %edx,%eax
  8c:   3d e8 03 00 00          cmp    $0x3e8,%eax
  91:   7f 07                   jg     9a <_Z26IsSumInRangeSuperOptimizedii+0x2f>
  93:   b8 01 00 00 00          mov    $0x1,%eax
  98:   eb 05                   jmp    9f <_Z26IsSumInRangeSuperOptimizedii+0x34>
  9a:   b8 00 00 00 00          mov    $0x0,%eax
  9f:   5d                      pop    %rbp
  a0:   c3                      retq

We can see from the stack addresses (for example, the -0x4 in mov %edi,-0x4(%rbp) versus the -0x14 in mov %edi,-0x14(%rbp)) that IsSumInRangeWithVar() uses 16 extra bytes on the stack.

Because IsSumInRangeWithoutVar() allocates no space on the stack to store the intermediate value s it has to recalculate it, resulting in this implementation being 2 instructions longer.

Funny, IsSumInRangeSuperOptimized() looks a lot like IsSumInRangeWithoutVar(), except it compares to -1000 first, and 1000 second.

Now let's compile with only the most basic optimizations: g++ -O1 -c -o test.o test.cpp. The result:

0000000000000000 <_Z19IsSumInRangeWithVarii>:
   0:   8d 84 37 e8 03 00 00    lea    0x3e8(%rdi,%rsi,1),%eax
   7:   3d d0 07 00 00          cmp    $0x7d0,%eax
   c:   0f 96 c0                setbe  %al
   f:   c3                      retq

0000000000000010 <_Z22IsSumInRangeWithoutVarii>:
  10:   8d 84 37 e8 03 00 00    lea    0x3e8(%rdi,%rsi,1),%eax
  17:   3d d0 07 00 00          cmp    $0x7d0,%eax
  1c:   0f 96 c0                setbe  %al
  1f:   c3                      retq

0000000000000020 <_Z26IsSumInRangeSuperOptimizedii>:
  20:   8d 84 37 e8 03 00 00    lea    0x3e8(%rdi,%rsi,1),%eax
  27:   3d d0 07 00 00          cmp    $0x7d0,%eax
  2c:   0f 96 c0                setbe  %al
  2f:   c3                      retq

Would you look at that: each variant is identical. The compiler is able to do something quite clever: abs(a + b) <= 1000 is equivalent to a + b + 1000 <= 2000 considering setbe does an unsigned comparison, so a negative number becomes a very large positive number. The lea instruction can actually perform all these additions in one instruction, and eliminate all the conditional branches.

To answer your question, almost always the thing to optimize for is not memory or speed, but readability. Reading code is a lot harder than writing it, and reading code that's been mangled to "optimize" it is a lot harder than reading code that's been written to be clear. More often than not, these "optimizations" have negligible, or as in this case exactly zero actual impact on performance.


Follow up question, what changes when this code is in an interpreted language instead of compiled? Then, does the optimization matter or does it have the same result?

Let's measure! I've transcribed the examples to Python:

def IsSumInRangeWithVar(a, b):
    s = a + b
    if s > 1000 or s < -1000:
        return False
    else:
        return True

def IsSumInRangeWithoutVar(a, b):
    if a + b > 1000 or a + b < -1000:
        return False
    else:
        return True

def IsSumInRangeSuperOptimized(a, b):
    return abs(a + b) <= 1000

from dis import dis
print('IsSumInRangeWithVar')
dis(IsSumInRangeWithVar)

print('\nIsSumInRangeWithoutVar')
dis(IsSumInRangeWithoutVar)

print('\nIsSumInRangeSuperOptimized')
dis(IsSumInRangeSuperOptimized)

print('\nBenchmarking')
import timeit
print('IsSumInRangeWithVar: %fs' % (min(timeit.repeat(lambda: IsSumInRangeWithVar(42, 42), repeat=50, number=100000)),))
print('IsSumInRangeWithoutVar: %fs' % (min(timeit.repeat(lambda: IsSumInRangeWithoutVar(42, 42), repeat=50, number=100000)),))
print('IsSumInRangeSuperOptimized: %fs' % (min(timeit.repeat(lambda: IsSumInRangeSuperOptimized(42, 42), repeat=50, number=100000)),))

Run with Python 3.5.2, this produces the output:

IsSumInRangeWithVar
  2           0 LOAD_FAST                0 (a)
              3 LOAD_FAST                1 (b)
              6 BINARY_ADD
              7 STORE_FAST               2 (s)

  3          10 LOAD_FAST                2 (s)
             13 LOAD_CONST               1 (1000)
             16 COMPARE_OP               4 (>)
             19 POP_JUMP_IF_TRUE        34
             22 LOAD_FAST                2 (s)
             25 LOAD_CONST               4 (-1000)
             28 COMPARE_OP               0 (<)
             31 POP_JUMP_IF_FALSE       38

  4     >>   34 LOAD_CONST               2 (False)
             37 RETURN_VALUE

  6     >>   38 LOAD_CONST               3 (True)
             41 RETURN_VALUE
             42 LOAD_CONST               0 (None)
             45 RETURN_VALUE

IsSumInRangeWithoutVar
  9           0 LOAD_FAST                0 (a)
              3 LOAD_FAST                1 (b)
              6 BINARY_ADD
              7 LOAD_CONST               1 (1000)
             10 COMPARE_OP               4 (>)
             13 POP_JUMP_IF_TRUE        32
             16 LOAD_FAST                0 (a)
             19 LOAD_FAST                1 (b)
             22 BINARY_ADD
             23 LOAD_CONST               4 (-1000)
             26 COMPARE_OP               0 (<)
             29 POP_JUMP_IF_FALSE       36

 10     >>   32 LOAD_CONST               2 (False)
             35 RETURN_VALUE

 12     >>   36 LOAD_CONST               3 (True)
             39 RETURN_VALUE
             40 LOAD_CONST               0 (None)
             43 RETURN_VALUE

IsSumInRangeSuperOptimized
 15           0 LOAD_GLOBAL              0 (abs)
              3 LOAD_FAST                0 (a)
              6 LOAD_FAST                1 (b)
              9 BINARY_ADD
             10 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             13 LOAD_CONST               1 (1000)
             16 COMPARE_OP               1 (<=)
             19 RETURN_VALUE

Benchmarking
IsSumInRangeWithVar: 0.019361s
IsSumInRangeWithoutVar: 0.020917s
IsSumInRangeSuperOptimized: 0.020171s

Disassembly in Python isn't terribly interesting, since the bytecode "compiler" doesn't do much in the way of optimization.

The performance of the three functions is nearly identical. We might be tempted to go with IsSumInRangeWithVar() due to it's marginal speed gain. Though I'll add as I was trying different parameters to timeit, sometimes IsSumInRangeSuperOptimized() came out fastest, so I suspect it may be external factors responsible for the difference, rather than any intrinsic advantage of any implementation.

If this is really performance critical code, an interpreted language is simply a very poor choice. Running the same program with pypy, I get:

IsSumInRangeWithVar: 0.000180s
IsSumInRangeWithoutVar: 0.001175s
IsSumInRangeSuperOptimized: 0.001306s

Just using pypy, which uses JIT compilation to eliminate a lot of the interpreter overhead, has yielded a performance improvement of 1 or 2 orders of magnitude. I was quite shocked to see IsSumInRangeWithVar() is an order of magnitude faster than the others. So I changed the order of the benchmarks and ran again:

IsSumInRangeSuperOptimized: 0.000191s
IsSumInRangeWithoutVar: 0.001174s
IsSumInRangeWithVar: 0.001265s

So it seems it's not actually anything about the implementation that makes it fast, but rather the order in which I do the benchmarking!

I'd love to dig in to this more deeply, because honestly I don't know why this happens. But I believe the point has been made: micro-optimizations like whether to declare an intermediate value as a variable or not are rarely relevant. With an interpreted language or highly optimized compiler, the first objective is still to write clear code.

If further optimization might be required, benchmark. Remember that the best optimizations come not from the little details but the bigger algorithmic picture: pypy is going to be an order of magnitude faster for repeated evaluation of the same function than cpython because it uses faster algorithms (JIT compiler vs interpretation) to evaluate the program. And there's the coded algorithm to consider as well: a search through a B-tree will be faster than a linked list.

After ensuring you're using the right tools and algorithms for the job, be prepared to dive deep into the details of the system. The results can be very surprising, even for experienced developers, and this is why you must have a benchmark to quantify the changes.

Related Topic