R – Processing byte pixels with SSE/SSE2 intrinsics in C

cimage processingoptimizationwebcam

I am programming, for cross-platform C, a library to do various things to webcam images. All operations are per-pixel and highly parallelizable – for example applying bit masks, multiplying color values by constants, etc. Therefore I think I can gain performance by using SSE/SSE2 intrinsics.

However, I am having a data format problem. My webcam library gives me webcam frames as a pointer (void*) to a buffer containing 24- or 32-bit byte pixels in ABGR or BGR format. I have been casting these to char* so that ptr++ etc behaves correctly. However, all the SSE/SSE2 operations expect either four integers or four floats, in the __m128 or __m64 data types. If I do this (assuming I have read the color values from the buffer into chars r, g, and b):

float pixel[] = {(float)r, (float)g, {float)b, 0.0f};

then load another float array full of constants

float constants[] = {0.299, 0.587, 0.114, 0.0f};

cast both float pointers to __m128, and use the __mm_mul_ps intrinsic to do r * 0.299, g * 0.587 etc
etc… there is no overall performance gain because all the shuffling stuff around takes up so much time!

Does anyone have any suggestions for how I can load these byte pixel values quickly and efficiently into the SSE registers so that I actually get a performance gain from operating on them as such?

Best Answer

If you are willing to use MMX...

MMX gives you a bunch of 64 bit registers that can treat each register as 8, 8-bit values.

Like the 8-bit values you're working with.

There's a good primer here.

Related Topic