Do to increase the performance of a Lua program

luaoptimizationperformance

I asked a question about Lua perfromance, and on of the responses asked:

Have you studied general tips for keeping Lua performance high? i.e. know table creation and rather reuse a table than create a new one, use of 'local print=print' and such to avoid global accesses.

This is a slightly different question from Lua Patterns,Tips and Tricks because I'd like answers that specifically impact performance and (if possible) an explanation of why performance is impacted.

One tip per answer would be ideal.

Best Answer

In response to some of the other answers and comments:

It is true that as a programmer you should generally avoid premature optimization. But. This is not so true for scripting languages where the compiler does not optimize much -- or at all.

So, whenever you write something in Lua, and that is executed very often, is run in a time-critical environment or could run for a while, it is a good thing to know things to avoid (and avoid them).

This is a collection of what I found out over time. Some of it I found out over the net, but being of a suspicious nature when the interwebs are concerned I tested all of it myself. Also, I have read the Lua performance paper at Lua.org.

Some reference:

Avoid globals

This is one of the most common hints, but stating it once more can't hurt.

Globals are stored in a hashtable by their name. Accessing them means you have to access a table index. While Lua has a pretty good hashtable implementation, it's still a lot slower than accessing a local variable. If you have to use globals, assign their value to a local variable, this is faster at the 2nd variable access.

do
  x = gFoo + gFoo;
end
do -- this actually performs better.
  local lFoo = gFoo;
  x = lFoo + lFoo;
end

(Not that simple testing may yield different results. eg. local x; for i=1, 1000 do x=i; end here the for loop header takes actually more time than the loop body, thus profiling results could be distorted.)

Avoid string creation

Lua hashes all strings on creation, this makes comparison and using them in tables very fast and reduces memory use since all strings are stored internally only once. But it makes string creation more expensive.

A popular option to avoid excessive string creation is using tables. For example, if you have to assemble a long string, create a table, put the individual strings in there and then use table.concat to join it once

-- do NOT do something like this
local ret = "";
for i=1, C do
  ret = ret..foo();
end

If foo() would return only the character A, this loop would create a series of strings like "", "A", "AA", "AAA", etc. Each string would be hashed and reside in memory until the application finishes -- see the problem here?

-- this is a lot faster
local ret = {};
for i=1, C do
  ret[#ret+1] = foo();
end
ret = table.concat(ret);

This method does not create strings at all during the loop, the string is created in the function foo and only references are copied into the table. Afterwards, concat creates a second string "AAAAAA..." (depending on how large C is). Note that you could use i instead of #ret+1 but often you don't have such a useful loop and you won't have an iterator variable you can use.

Another trick I found somewhere on lua-users.org is to use gsub if you have to parse a string

some_string:gsub(".", function(m)
  return "A";
end);

This looks odd at first, the benefit is that gsub creates a string "at once" in C which is only hashed after it is passed back to lua when gsub returns. This avoids table creation, but possibly has more function overhead (not if you call foo() anyway, but if foo() is actually an expression)

Avoid function overhead

Use language constructs instead of functions where possible

function `ipairs`

When iterating a table, the function overhead from ipairs does not justify it's use. To iterate a table, instead use

for k=1, #tbl do local v = tbl[k];

It does exactly the same without the function call overhead (pairs actually returns another function which is then called for every element in the table while #tbl is only evaluated once). It's a lot faster, even if you need the value. And if you don't...

Note for Lua 5.2: In 5.2 you can actually define a __ipairs field in the metatable, which does make ipairs useful in some cases. However, Lua 5.2 also makes the __len field work for tables, so you might still prefer the above code to ipairs as then the __len metamethod is only called once, while for ipairs you would get an additional function call per iteration.

functions `table.insert`, `table.remove`

Simple uses of table.insert and table.remove can be replaced by using the # operator instead. Basically this is for simple push and pop operations. Here are some examples:

table.insert(foo, bar);
-- does the same as
foo[#foo+1] = bar;

local x = table.remove(foo);
-- does the same as
local x = foo[#foo];
foo[#foo] = nil;

For shifts (eg. table.remove(foo, 1)), and if ending up with a sparse table is not desirable, it is of course still better to use the table functions.

Use tables for SQL-IN alike compares

You might - or might not - have decisions in your code like the following

if a == "C" or a == "D" or a == "E" or a == "F" then
   ...
end

Now this is a perfectly valid case, however (from my own testing) starting with 4 comparisons and excluding table generation, this is actually faster:

local compares = { C = true, D = true, E = true, F = true };
if compares[a] then
   ...
end

And since hash tables have constant look up time, the performance gain increases with every additional comparison. On the other hand if "most of the time" one or two comparisons match, you might be better off with the Boolean way or a combination.

Avoid frequent table creation

This is discussed thoroughly in Lua Performance Tips. Basically the problem is that Lua allocates your table on demand and doing it this way will actually take more time than cleaning it's content and filling it again.

However, this is a bit of a problem, since Lua itself does not provide a method for removing all elements from a table, and pairs() is not the performance beast itself. I have not done any performance testing on this problem myself yet.

If you can, define a C function that clears a table, this should be a good solution for table reuse.

Avoid doing the same over and over

This is the biggest problem, I think. While a compiler in a non-interpreted language can easily optimize away a lot of redundancies, Lua will not.

Memoize

Using tables this can be done quite easily in Lua. For single-argument functions you can even replace them with a table and __index metamethod. Even though this destroys transparancy, performance is better on cached values due to one less function call.

Here is an implementation of memoization for a single argument using a metatable. (Important: This variant does not support a nil value argument, but is pretty damn fast for existing values.)

function tmemoize(func)
    return setmetatable({}, {
        __index = function(self, k)
            local v = func(k);
            self[k] = v
            return v;
        end
    });
end
-- usage (does not support nil values!)
local mf = tmemoize(myfunc);
local v  = mf[x];

You could actually modify this pattern for multiple input values

Partial application

The idea is similar to memoization, which is to "cache" results. But here instead of caching the results of the function, you would cache intermediate values by putting their calculation in a constructor function that defines the calculation function in it's block. In reality I would just call it clever use of closures.

-- Normal function
function foo(a, b, x)
    return cheaper_expression(expensive_expression(a,b), x);
end
-- foo(a,b,x1);
-- foo(a,b,x2);
-- ...

-- Partial application
function foo(a, b)
    local C = expensive_expression(a,b);
    return function(x)
        return cheaper_expression(C, x);
    end
end
-- local f = foo(a,b);
-- f(x1);
-- f(x2);
-- ...

This way it is possible to easily create flexible functions that cache some of their work without too much impact on program flow.

An extreme variant of this would be Currying, but that is actually more a way to mimic functional programming than anything else.

Here is a more extensive ("real world") example with some code omissions, otherwise it would easily take up the whole page here (namely get_color_values actually does a lot of value checking and recognizes accepts mixed values)

function LinearColorBlender(col_from, col_to)
    local cfr, cfg, cfb, cfa = get_color_values(col_from);
    local ctr, ctg, ctb, cta = get_color_values(col_to);
    local cdr, cdg, cdb, cda = ctr-cfr, ctg-cfg, ctb-cfb, cta-cfa;
    if not cfr or not ctr then
        error("One of given arguments is not a color.");
    end

    return function(pos)
        if type(pos) ~= "number" then
            error("arg1 (pos) must be in range 0..1");
        end
        if pos < 0 then pos = 0; end;
        if pos > 1 then pos = 1; end;
        return cfr + cdr*pos, cfg + cdg*pos, cfb + cdb*pos, cfa + cda*pos;
    end
end
-- Call 
local blender = LinearColorBlender({1,1,1,1},{0,0,0,1});
object:SetColor(blender(0.1));
object:SetColor(blender(0.3));
object:SetColor(blender(0.7));

You can see that once the blender was created, the function only has to sanity-check a single value instead of up to eight. I even extracted the difference calculation, though it probably does not improve a lot, I hope it shows what this pattern tries to achieve.

Related Solutions

C++ – the most effective way for float and double comparison

Be extremely careful using any of the other suggestions. It all depends on context.

I have spent a long time tracing a bugs in a system that presumed a==b if |a-b|<epsilon. The underlying problems were:

The implicit presumption in an algorithm that if a==b and b==c then a==c.
Using the same epsilon for lines measured in inches and lines measured in mils (.001 inch). That is a==b but 1000a!=1000b. (This is why AlmostEqual2sComplement asks for the epsilon or max ULPS).
The use of the same epsilon for both the cosine of angles and the length of lines!
Using such a compare function to sort items in a collection. (In this case using the builtin C++ operator == for doubles produced correct results.)

Like I said: it all depends on context and the expected size of a and b.

BTW, std::numeric_limits<double>::epsilon() is the "machine epsilon". It is the difference between 1.0 and the next value representable by a double. I guess that it could be used in the compare function but only if the expected values are less than 1. (This is in response to @cdv's answer...)

Also, if you basically have int arithmetic in doubles (here we use doubles to hold int values in certain cases) your arithmetic will be correct. For example 4.0/2.0 will be the same as 1.0+1.0. This is as long as you do not do things that result in fractions (4.0/3.0) or do not go outside of the size of an int.

Java – What are the effects of exceptions on performance in Java

It depends how exceptions are implemented. The simplest way is using setjmp and longjmp. That means all registers of the CPU are written to the stack (which already takes some time) and possibly some other data needs to be created... all this already happens in the try statement. The throw statement needs to unwind the stack and restore the values of all registers (and possible other values in the VM). So try and throw are equally slow, and that is pretty slow, however if no exception is thrown, exiting the try block takes no time whatsoever in most cases (as everything is put on the stack which cleans up automatically if the method exists).

Sun and others recognized, that this is possibly suboptimal and of course VMs get faster and faster over the time. There is another way to implement exceptions, which makes try itself lightning fast (actually nothing happens for try at all in general - everything that needs to happen is already done when the class is loaded by the VM) and it makes throw not quite as slow. I don't know which JVM uses this new, better technique...

...but are you writing in Java so your code later on only runs on one JVM on one specific system? Since if it may ever run on any other platform or any other JVM version (possibly of any other vendor), who says they also use the fast implementation? The fast one is more complicated than the slow one and not easily possible on all systems. You want to stay portable? Then don't rely on exceptions being fast.

It also makes a big difference what you do within a try block. If you open a try block and never call any method from within this try block, the try block will be ultra fast, as the JIT can then actually treat a throw like a simple goto. It neither needs to save stack-state nor does it need to unwind the stack if an exception is thrown (it only needs to jump to the catch handlers). However, this is not what you usually do. Usually you open a try block and then call a method that might throw an exception, right? And even if you just use the try block within your method, what kind of method will this be, that does not call any other method? Will it just calculate a number? Then what for do you need exceptions? There are much more elegant ways to regulate program flow. For pretty much anything else but simple math, you will have to call an external method and this already destroys the advantage of a local try block.

See the following test code:

public class Test {
    int value;


    public int getValue() {
        return value;
    }

    public void reset() {
        value = 0;
    }

    // Calculates without exception
    public void method1(int i) {
        value = ((value + i) / i) << 1;
        // Will never be true
        if ((i & 0xFFFFFFF) == 1000000000) {
            System.out.println("You'll never see this!");
        }
    }

    // Could in theory throw one, but never will
    public void method2(int i) throws Exception {
        value = ((value + i) / i) << 1;
        // Will never be true
        if ((i & 0xFFFFFFF) == 1000000000) {
            throw new Exception();
        }
    }

    // This one will regularly throw one
    public void method3(int i) throws Exception {
        value = ((value + i) / i) << 1;
        // i & 1 is equally fast to calculate as i & 0xFFFFFFF; it is both
        // an AND operation between two integers. The size of the number plays
        // no role. AND on 32 BIT always ANDs all 32 bits
        if ((i & 0x1) == 1) {
            throw new Exception();
        }
    }

    public static void main(String[] args) {
        int i;
        long l;
        Test t = new Test();

        l = System.currentTimeMillis();
        t.reset();
        for (i = 1; i < 100000000; i++) {
            t.method1(i);
        }
        l = System.currentTimeMillis() - l;
        System.out.println(
            "method1 took " + l + " ms, result was " + t.getValue()
        );

        l = System.currentTimeMillis();
        t.reset();
        for (i = 1; i < 100000000; i++) {
            try {
                t.method2(i);
            } catch (Exception e) {
                System.out.println("You'll never see this!");
            }
        }
        l = System.currentTimeMillis() - l;
        System.out.println(
            "method2 took " + l + " ms, result was " + t.getValue()
        );

        l = System.currentTimeMillis();
        t.reset();
        for (i = 1; i < 100000000; i++) {
            try {
                t.method3(i);
            } catch (Exception e) {
                // Do nothing here, as we will get here
            }
        }
        l = System.currentTimeMillis() - l;
        System.out.println(
            "method3 took " + l + " ms, result was " + t.getValue()
        );
    }
}

Result:

method1 took 972 ms, result was 2
method2 took 1003 ms, result was 2
method3 took 66716 ms, result was 2

The slowdown from the try block is too small to rule out confounding factors such as background processes. But the catch block killed everything and made it 66 times slower!

As I said, the result will not be that bad if you put try/catch and throw all within the same method (method3), but this is a special JIT optimization I would not rely upon. And even when using this optimization, the throw is still pretty slow. So I don't know what you are trying to do here, but there is definitely a better way of doing it than using try/catch/throw.