How to clear a buffer in Assembly?

assemblyx86

I'm programming in assembler and I've created an auxiliary buffer to temporarily to store some characters and then write them to a file. The thing is that I have to use the auxiliary buffer several times so I need to clean/restore it and I don't know how to accomplish this.

I'm using i386 architecture with Linux.

PD: I forgot, this is how I have the buffer declared:

section .bss           
      auxBuffer resb 1000

Best Answer

You clear a buffer by writing to it.
Obviously a location in memory is never truly empty, so clearing a buffer effectively means fill it with zero's.
There is nothing stopping you from filling it with 0xCAFEBABE hex values, but zero's is the standard convention.

mov dword ptr [buffer],0
mov dword ptr [buffer+4],0
mov ...              +8],0
....

Obviously this is a silly way of clearing a buffer. If the buffer were 40,000 bytes you'd need 10,000 instructions; possible but wasteful.

Instead you write a loop and use a counter to keep track of what you've written so far.

mov ecx,10000        //10000 x 4 = 40.000 bytes
lea edx,[buffer]     //point edx to the start of the block to be cleared.  
xor eax,eax          //xor reg,reg is the same as mov reg,0, but faster.
@loop:
mov [edx+ecx*4],eax  //fill the buffer starting at the end.
dec ecx              //decrease the counter; also moving the destination.   
jnz @loop            //if ecx <> 0 (Non-Zero) then repeat.
ret                  //we are done, return.

ecx does double duty as a counter and as a pointer into the buffer.
Note that xor eax,eax is the standard way of setting a register to zero. It is both shorter than mov eax,0 and faster because the CPU is hard-wired to give the former instruction preferential treatment.
There is however an even shorter way of doing this.
x86 has so called string instructions that accept a rep (repeat) prefix.
If thus prefixed the instruction will run ecx times.

push edi           //on Windows edi is non-volatile (callee-save), so save it.
lea edi,[buffer]   //edi is the start of the buffer.
xor eax,eax        //fill the buffer with zeros.
mov ecx,10000      
rep stosd          //store string with 10000 dwords = 40000 bytes.
pop edi            //restore edi to its previous value.
ret                //return.

stosd (s̲t̲o̲re s̲tring per d̲word) uses edi as its D̲estination and ecx as the C̲ounter and eax as the source.
Note that stosd is a complex instruction. Modern x86 CPU's work faster with simple instructions and often an optimized (!) version of the second code snippet will work faster than a simple use of rep stosd.
rep stosd moves forward by default on Windows/Linux. You can set up the CPU to make it move backwards, but then you'd have to restore the direction setting afterwards.

Under windows eax, ecx and edx are volatile and can be changed at will. All other registers need to be preserved between calls.

Related Solutions

When is assembly faster than C?

Here is a real world example: Fixed point multiplies on old compilers.

These don't only come handy on devices without floating point, they shine when it comes to precision as they give you 32 bits of precision with a predictable error (float only has 23 bit and it's harder to predict precision loss). i.e. uniform absolute precision over the entire range, instead of close-to-uniform relative precision (float).

Modern compilers optimize this fixed-point example nicely, so for more modern examples that still need compiler-specific code, see

Getting the high part of 64 bit integer multiplication: A portable version using uint64_t for 32x32 => 64-bit multiplies fails to optimize on a 64-bit CPU, so you need intrinsics or __int128 for efficient code on 64-bit systems.
_umul128 on Windows 32 bits: MSVC doesn't always do a good job when multiplying 32-bit integers cast to 64, so intrinsics helped a lot.

C doesn't have a full-multiplication operator (2N-bit result from N-bit inputs). The usual way to express it in C is to cast the inputs to the wider type and hope the compiler recognizes that the upper bits of the inputs aren't interesting:

// on a 32-bit machine, int can hold 32-bit fixed-point integers.
int inline FixedPointMul (int a, int b)
{
  long long a_long = a; // cast to 64 bit.

  long long product = a_long * b; // perform multiplication

  return (int) (product >> 16);  // shift by the fixed point bias
}

The problem with this code is that we do something that can't be directly expressed in the C-language. We want to multiply two 32 bit numbers and get a 64 bit result of which we return the middle 32 bit. However, in C this multiply does not exist. All you can do is to promote the integers to 64 bit and do a 64*64 = 64 multiply.

x86 (and ARM, MIPS and others) can however do the multiply in a single instruction. Some compilers used to ignore this fact and generate code that calls a runtime library function to do the multiply. The shift by 16 is also often done by a library routine (also the x86 can do such shifts).

So we're left with one or two library calls just for a multiply. This has serious consequences. Not only is the shift slower, registers must be preserved across the function calls and it does not help inlining and code-unrolling either.

If you rewrite the same code in (inline) assembler you can gain a significant speed boost.

In addition to this: using ASM is not the best way to solve the problem. Most compilers allow you to use some assembler instructions in intrinsic form if you can't express them in C. The VS.NET2008 compiler for example exposes the 32*32=64 bit mul as __emul and the 64 bit shift as __ll_rshift.

Using intrinsics you can rewrite the function in a way that the C-compiler has a chance to understand what's going on. This allows the code to be inlined, register allocated, common subexpression elimination and constant propagation can be done as well. You'll get a huge performance improvement over the hand-written assembler code that way.

For reference: The end-result for the fixed-point mul for the VS.NET compiler is:

int inline FixedPointMul (int a, int b)
{
    return (int) __ll_rshift(__emul(a,b),16);
}

The performance difference of fixed point divides is even bigger. I had improvements up to factor 10 for division heavy fixed point code by writing a couple of asm-lines.

Using Visual C++ 2013 gives the same assembly code for both ways.

gcc4.1 from 2007 also optimizes the pure C version nicely. (The Godbolt compiler explorer doesn't have any earlier versions of gcc installed, but presumably even older GCC versions could do this without intrinsics.)

See source + asm for x86 (32-bit) and ARM on the Godbolt compiler explorer. (Unfortunately it doesn't have any compilers old enough to produce bad code from the simple pure C version.)

Modern CPUs can do things C doesn't have operators for at all, like popcnt or bit-scan to find the first or last set bit. (POSIX has a ffs() function, but its semantics don't match x86 bsf / bsr. See https://en.wikipedia.org/wiki/Find_first_set).

Some compilers can sometimes recognize a loop that counts the number of set bits in an integer and compile it to a popcnt instruction (if enabled at compile time), but it's much more reliable to use __builtin_popcnt in GNU C, or on x86 if you're only targeting hardware with SSE4.2: _mm_popcnt_u32 from <immintrin.h>.

Or in C++, assign to a std::bitset<32> and use .count(). (This is a case where the language has found a way to portably expose an optimized implementation of popcount through the standard library, in a way that will always compile to something correct, and can take advantage of whatever the target supports.) See also https://en.wikipedia.org/wiki/Hamming_weight#Language_support.

Similarly, ntohl can compile to bswap (x86 32-bit byte swap for endian conversion) on some C implementations that have it.

Another major area for intrinsics or hand-written asm is manual vectorization with SIMD instructions. Compilers are not bad with simple loops like dst[i] += src[i] * 10.0;, but often do badly or don't auto-vectorize at all when things get more complicated. For example, you're unlikely to get anything like How to implement atoi using SIMD? generated automatically by the compiler from scalar code.

How to write self-modifying code in x86 assembly

wow, this turned out to be a lot more painful than I expected. 100% of the pain was linux protecting the program from being overwritten and/or executing data.

Two solutions shown below. And a lot of googling was involved so the somewhat simple put some instruction bytes and execute them was mine, the mprotect and aligning on page size was culled from google searches, stuff I had to learn for this example.

The self modifying code is straight forward, if you take the program or at least just the two simple functions, compile and then disassemble you will get the opcodes for those instructions. or use nasm to compile blocks of assembler, etc. From this I determined the opcode to load an immediate into eax then return.

Ideally you simply put those bytes in some ram and execute that ram. To get linux to do that you have to change the protection, which means you have to send it a pointer that is aligned on a mmap page. So allocate more than you need, find the aligned address within that allocation that is on a page boundary and mprotect from that address and use that memory to put your opcodes and then execute.

the second example takes an existing function compiled into the program, again because of the protection mechanism you cannot simply point at it and change bytes, you have to unprotect it from writes. So you have to back up to the prior page boundary call mprotect with that address and enough bytes to cover the code to be modified. Then you can change the bytes/opcodes for that function in any way you want (so long as you don't spill over into any function you want to continue to use) and execute it. In this case you can see that fun() works, then I change it to simply return a value, call it again and now it has been modified.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>

unsigned char *testfun;

unsigned int fun ( unsigned int a )
{
    return(a+13);
}

unsigned int fun2 ( void )
{
    return(13);
}

int main ( void )
{
    unsigned int ra;
    unsigned int pagesize;
    unsigned char *ptr;
    unsigned int offset;

    pagesize=getpagesize();
    testfun=malloc(1023+pagesize+1);
    if(testfun==NULL) return(1);
    //need to align the address on a page boundary
    printf("%p\n",testfun);
    testfun = (unsigned char *)(((long)testfun + pagesize-1) & ~(pagesize-1));
    printf("%p\n",testfun);

    if(mprotect(testfun, 1024, PROT_READ|PROT_EXEC|PROT_WRITE))
    {
        printf("mprotect failed\n");
        return(1);
    }

    //400687: b8 0d 00 00 00          mov    $0xd,%eax
    //40068d: c3                      retq

    testfun[ 0]=0xb8;
    testfun[ 1]=0x0d;
    testfun[ 2]=0x00;
    testfun[ 3]=0x00;
    testfun[ 4]=0x00;
    testfun[ 5]=0xc3;

    ra=((unsigned int (*)())testfun)();
    printf("0x%02X\n",ra);


    testfun[ 0]=0xb8;
    testfun[ 1]=0x20;
    testfun[ 2]=0x00;
    testfun[ 3]=0x00;
    testfun[ 4]=0x00;
    testfun[ 5]=0xc3;

    ra=((unsigned int (*)())testfun)();
    printf("0x%02X\n",ra);


    printf("%p\n",fun);
    offset=(unsigned int)(((long)fun)&(pagesize-1));
    ptr=(unsigned char *)((long)fun&(~(pagesize-1)));


    printf("%p 0x%X\n",ptr,offset);

    if(mprotect(ptr, pagesize, PROT_READ|PROT_EXEC|PROT_WRITE))
    {
        printf("mprotect failed\n");
        return(1);
    }

    //for(ra=0;ra&lt;20;ra++) printf("0x%02X,",ptr[offset+ra]); printf("\n");

    ra=4;
    ra=fun(ra);
    printf("0x%02X\n",ra);

    ptr[offset+0]=0xb8;
    ptr[offset+1]=0x22;
    ptr[offset+2]=0x00;
    ptr[offset+3]=0x00;
    ptr[offset+4]=0x00;
    ptr[offset+5]=0xc3;

    ra=4;
    ra=fun(ra);
    printf("0x%02X\n",ra);

    return(0);
}

Best Answer

Related Solutions

When is assembly faster than C?

How to write self-modifying code in x86 assembly

Related Topic