C++ – Efficient Multiply/Divide of two 128-bit Integers on x86 (no 64-bit)

algorithmbignumcx86

Compiler: MinGW/GCC
Issues: No GPL/LGPL code allowed (GMP or any bignum library for that matter, is overkill for this problem, as I already have the class implemented).

I have constructed my own 128-bit fixed-size big integer class (intended for use in a game engine but may be generalized to any usage case) and I find the performance of the current multiply and divide operations to be quite abysmal (yes, I have timed them, see below), and I'd like to improve on (or change) the algorithms that do the low-level number crunching.

When it comes to the multiply and divide operators, they are unbearably slow compared to just about everything else in the class.

These are the approximate measurements relative to my own computer:

Raw times as defined by QueryPerformanceFrequency:
1/60sec          31080833u
Addition:              ~8u
Subtraction:           ~8u
Multiplication:      ~546u
Division:           ~4760u (with maximum bit count)

As you can see, just doing the multiplication is many, many times slower than add or subtract. Division is about 10 times slower than multiplication.

I'd like to improve the speed of these two operators because there may be a very high amount of computations being done per frame (dot products, various collision detection methods, etc).

The structure (methods omitted) looks somewhat like:

class uint128_t
{
    public:
        unsigned long int dw3, dw2, dw1, dw0;
  //...
}

Multiplication is currently done using the typical long-multiplication method (in assembly so that I can catch the EDX output) while ignoring the words that go out of range (that is, I'm only doing 10 mull's compared to 16).

Division uses the shift-subtract algorithm (speed depends on bit counts of the operands). However, it is not done in assembly. I found that a little too difficult to muster and decided to let the compiler optimize it.

I have Google'd around for several days looking at pages describing algorithms such as Karatsuba Multiplication, high-radix division, and Newton-Rapson Division but the math symbols are a little too far over my head. I'd like to use some of these advanced methods to speed up my code, but I'd have to translate the "Greek" into something comprehensible first.

For those that may deem my efforts "premature optimization"; I consider this code to be a bottleneck because the very elementary-math operations themselves become slow. I can ignore such types of optimization on higher-level code, but this code will be called/used enough for it to matter.

I'd like suggestions on which algorithm I should use to improve multiply and divide (if possible), and a basic (hopefully easy to understand) explanation on how the suggested algorithm works would be highly appreciated.

EDIT: Multiply Improvements

I was able to improve the multiply operation by inlining code into operator*= and it seems as fast as possible.

Updated raw times:
1/60sec          31080833u
Addition:              ~8u
Subtraction:           ~8u
Multiplication:      ~100u (lowest ~86u, highest around ~256u)
Division:           ~4760u (with maximum bit count)

Here's some bare-bones code for you to examine (note that my type names are actually different, this was edited for simplicity):

//File: "int128_t.h"
class int128_t
{
    uint32_t dw3, dw2, dw1, dw0;

    // Various constrctors, operators, etc...

    int128_t& operator*=(const int128_t&  rhs) __attribute__((always_inline))
    {
        int128_t Urhs(rhs);
        uint32_t lhs_xor_mask = (int32_t(dw3) >> 31);
        uint32_t rhs_xor_mask = (int32_t(Urhs.dw3) >> 31);
        uint32_t result_xor_mask = (lhs_xor_mask ^ rhs_xor_mask);
        dw0 ^= lhs_xor_mask;
        dw1 ^= lhs_xor_mask;
        dw2 ^= lhs_xor_mask;
        dw3 ^= lhs_xor_mask;
        Urhs.dw0 ^= rhs_xor_mask;
        Urhs.dw1 ^= rhs_xor_mask;
        Urhs.dw2 ^= rhs_xor_mask;
        Urhs.dw3 ^= rhs_xor_mask;
        *this += (lhs_xor_mask & 1);
        Urhs += (rhs_xor_mask & 1);

        struct mul128_t
        {
            int128_t dqw1, dqw0;
            mul128_t(const int128_t& dqw1, const int128_t& dqw0): dqw1(dqw1), dqw0(dqw0){}
        };

        mul128_t data(Urhs,*this);
        asm volatile(
        "push      %%ebp                            \n\
        movl       %%eax,   %%ebp                   \n\
        movl       $0x00,   %%ebx                   \n\
        movl       $0x00,   %%ecx                   \n\
        movl       $0x00,   %%esi                   \n\
        movl       $0x00,   %%edi                   \n\
        movl   28(%%ebp),   %%eax #Calc: (dw0*dw0)  \n\
        mull             12(%%ebp)                  \n\
        addl       %%eax,   %%ebx                   \n\
        adcl       %%edx,   %%ecx                   \n\
        adcl       $0x00,   %%esi                   \n\
        adcl       $0x00,   %%edi                   \n\
        movl   24(%%ebp),   %%eax #Calc: (dw1*dw0)  \n\
        mull             12(%%ebp)                  \n\
        addl       %%eax,   %%ecx                   \n\
        adcl       %%edx,   %%esi                   \n\
        adcl       $0x00,   %%edi                   \n\
        movl   20(%%ebp),   %%eax #Calc: (dw2*dw0)  \n\
        mull             12(%%ebp)                  \n\
        addl       %%eax,   %%esi                   \n\
        adcl       %%edx,   %%edi                   \n\
        movl   16(%%ebp),   %%eax #Calc: (dw3*dw0)  \n\
        mull             12(%%ebp)                  \n\
        addl       %%eax,   %%edi                   \n\
        movl   28(%%ebp),   %%eax #Calc: (dw0*dw1)  \n\
        mull              8(%%ebp)                  \n\
        addl       %%eax,   %%ecx                   \n\
        adcl       %%edx,   %%esi                   \n\
        adcl       $0x00,   %%edi                   \n\
        movl   24(%%ebp),   %%eax #Calc: (dw1*dw1)  \n\
        mull              8(%%ebp)                  \n\
        addl       %%eax,   %%esi                   \n\
        adcl       %%edx,   %%edi                   \n\
        movl   20(%%ebp),   %%eax #Calc: (dw2*dw1)  \n\
        mull              8(%%ebp)                  \n\
        addl       %%eax,   %%edi                   \n\
        movl   28(%%ebp),   %%eax #Calc: (dw0*dw2)  \n\
        mull              4(%%ebp)                  \n\
        addl       %%eax,   %%esi                   \n\
        adcl       %%edx,   %%edi                   \n\
        movl   24(%%ebp),  %%eax #Calc: (dw1*dw2)   \n\
        mull              4(%%ebp)                  \n\
        addl       %%eax,   %%edi                   \n\
        movl   28(%%ebp),   %%eax #Calc: (dw0*dw3)  \n\
        mull               (%%ebp)                  \n\
        addl       %%eax,   %%edi                   \n\
        pop        %%ebp                            \n"
        :"=b"(this->dw0),"=c"(this->dw1),"=S"(this->dw2),"=D"(this->dw3)
        :"a"(&data):"%ebp");

        dw0 ^= result_xor_mask;
        dw1 ^= result_xor_mask;
        dw2 ^= result_xor_mask;
        dw3 ^= result_xor_mask;
        return (*this += (result_xor_mask & 1));
    }
};

As for division, examining the code is rather pointless, as I will need to change the mathematical algorithm to see any substantial benefits. The only feasible choice seems to be high-radix division, but I have yet to iron out (in my mind) just how it will work.

Best Answer

I wouldn't worry much about multiplication. What you're doing seems quite efficient. I didn't really follow the Greek on the Karatsuba Multiplication, but my feeling is that it would be more efficient only with much larger numbers than you're dealing with.

One suggestion I do have is to try to use the smallest blocks of inline assembly, rather than coding your logic in assembly. You could write a function:

struct div_result { u_int x[2]; };
static inline void mul_add(int a, int b, struct div_result *res);

The function would be implemented in inline assembly, and you'll call it from C++ code. It should be as efficient as pure assembly, and much easier to code.

About division, I don't know. Most algorithms I saw talk about asymptotic efficiency, which may mean they're efficient only for very high numbers of bits.

Related Solutions

Dumping the values of the registers in GCC

There is no need to use assembler just to get the registers.

You can just use setjmp. That will write all registers into a stucture of type jmp_buf. It even kind of Works cross platform except for the fact that jmp_buf itself is different for each architecture.

However, calling setjmp (and calling your assembler code as well) will change some of the registers, so you can't really trust them.

There is a way to get a real snapshot, but that's a bit more difficult and highly OS dependent:

install an exception handler for the illegal illegal opcode extension. The handler can be either a real interrupt, a signal handler or a OS exception handler (the try/except blocks form C++ will not work).
Emit an illegal opcode in your code.

The trick here is, that the illegal opcode has no register side-effects. The exception handler can copy the registers either from the stack or from a exception info structure.

The same trick may work with breakpoint interrupts forced overflows, traps or so. There is usually more than one way to raise an interrupt from a piece of code.

Regarding the EFLAGS: You can get them via a stack operation:

  PUSHFD
  POP EAX  
  , eax now contains the EFLAG data

C++ – Why does the program consume 100% CPU under nVidia NView

This is interesting because nView is a 3rd party DLL provided by NVidia. Postings on the internet about nview!NVLoadDatabase suggest that there is an unpatched defect in nview. This is supported by the fact that explorer uses 100% CPU, as confirmed by these reports. See: http://forums.nvidia.com/lofiversion/index.php?t36879.html

A detailed investigation of this problem is available on this site: http://blogs.technet.com/marcelofartura/archive/2007/02/28/real-case-random-apps-running-100-cpu.aspx

As per this article, the hang is due to an infinite loop in nview.dll. Although the assembly instructions and register values described online do not exactly match those in our log, they were close enough for me to conclude that it is the same issue.

To work around the problem, I disabled nView Desktop Manager (Right click on the desktop, select nView Properties, and click disable in the nView Desktop Manager groupbox). Before doing this I was able to consistently reproduce the hang. However, after disabling nView I could not reproduce the hang. Thus, this appears to be a viable workaround.

Anyway, I posted this up here in case it will be useful to anyone. It caused me a LOT of grief chasing this one down.

EDIT: Multiply Improvements

Best Answer

Related Solutions

Dumping the values of the registers in GCC

C++ – Why does the program consume 100% CPU under nVidia NView

Related Topic