Comparing images with AVX

This is a submission for the 2025 Hacktoberfest Writing Challenge

This is the first time I heard about hacktoberfest - a month-long initiative in October that encourages developers around the world to contribute to open source projects. This looked interesting, so I thought I'd give it a go.

Some time before that I started learning more about SIMD (AVX512 specifically) and was looking for a project where I can apply this in practice. One of the projects I knew was odiff, which I learned about at the FunOCaml conference I went to last year.

This blog post summarises my contribution to odiff, by rewriting odiff core algorithm into AVX512 assembly.

odiff

odiff is a very fast pixel-by-pixel image comparison tool. It is comparing images in the YIQ colour space rather than the RGB space to better represent visual differences. It's a really cool project. Go check it out.

It was originally written in OCaml and recently it was rewritten in zig for better SIMD support.

SIMD

Normally, CPU can only operate on a single memory location at a time (e.g. a byte, word, etc). This means that if we want to apply the same operation to large number of different memory locations, we need to do it in a loop, one element at a time. This is quite tedious and easy to get wrong. It also doesn't use all the power the CPU has.

There's a lot of workflows that could be optimised to do this is the hardware. Image processing, like in odiff, is one example. Other examples include audio and video processing. All of them essentially apply the same processing to a very large number of data samples.

Originally introduced as MMX extension for x86, but over time evolved to a large number of other extensions, with growing capabilities.
Each CPU has its own SIMD extensions: x86 has MMX, SSE and AVX, ARM has NEON, and RISC-V has RVV.

AVX512

AVX is a variant of x86 SIMD that operates on 256 bits of data at a time.
AVX512 is a evolution of AVX that operates on 512 bit of data. It's really awesome.

512 bits is 64 bytes, which for RGBA image data means we can operate on 16 pixels at a time in one instruction.

If you ever used array programming languages, AVX512 kind of feels similar (though of course it's definitely lower level). You always operate on whole arrays, not single elements. This requires a shift in the way you think about doing things, but it's very rewarding and can give you some nice performance boost.

vxdiff

This lead to the creation of vxdiff, a pure assembly AVX512 based reimplementation of the core odiff algorithm. It only implements the default set of odiff options (and odiff has some interesting options to tweak its behaviour), but it's still a capable tool. It's also easier to start simple rather than trying to implement all the possible options right away.

It was a really fun thing to do, and I learnt a lot about AVX

The algorithm

I will now explain the odiff core algorithm after translation to AVX. This might be a bit chaotic, because I was writing this in a bit of a hurry, so bear with me.

The full source is here, you can follow along.

First the input images are loaded into memory using RGBA8 format, which means that there's 3 colour channels (red, green, blue) and an alpha channel, each having 8 bits.
Once we've got the image data, the actual processing loop starts at .y_loop label, which iterates over image rows. Each row is handled by .x_loop label, which is a loop that loads 512bit of data at a time. rdi and rsi point to image data of the base and other images. After executing xmm1 and xmm2 will have 64 bytes of loaded data, which we can now operate on.

vmovdqu8 xmm1, [rdi]
vmovdqu8 xmm2, [rsi]

Now, we want to replace pixels having alpha equal to 0 with white. First we clear k6 and k7 mask registers with kxor. Then we check alpha channel to see which values are equal to 0 and store the result into k6 mask using vpcmpequb. At the indices where alpha was 0, mask will equal 1 and for all other indices it will be 0. Then we propagate that information to other channels (red, green and blue) using kshiftlb and kor. Then we store 255 at places where resulting mask is 1. We do this twice - once for the base image and second time for the image we compare against.

kxor k6, k6, k6              
kxor k7, k7, k7              
vpcmpequb k6 {k4}, xmm1, xmm0
kshiftlb k6, k6, 1           
kor k7, k7, k6               
kshiftlb k6, k6, 1           
kor k7, k7, k6               
kshiftlb k6, k6, 1           
kor k7, k7, k6               
vmovdqu8 xmm1 {k7}, xmm31

Next, we want to do some multiplications, so we'll convert bytes to floats. First we convert from bytes to dwords using vpmovzxbd and then from dwords to floats using vcvtudq2ps.

vpmovzxbd zmm1, xmm1  
vcvtudq2ps zmm1, zmm1 
vpmovzxbd zmm2, xmm2  
vcvtudq2ps zmm2, zmm2

Normalise alpha from initial 0-255 range to 0.0 - 1.0 range. k4 mask apply the vmulps operation (which does multiplication) only to alpha channel.

vmulps zmm1 {k4}, zmm1, zmm3
vmulps zmm2 {k4}, zmm2, zmm3

Do alpha-blending, so blend red, green and blue channels with white pixel using alpha channel. This essentially compute (RGB - 255.0) * A + 255.0. I'm not actually sure why it's computed this way, but this is what odiff used to do in the previous version. k5 apply the operations only to red, green and blue channels, leaving alpha channel unmodified.

vsubps zmm1 {k5}, zmm1, zmm30  
vshufps zmm10, zmm1, zmm1, 0xff
vmulps zmm1 {k5}, zmm1, zmm10  
vaddps zmm1 {k5}, zmm1, zmm30

Now we start RGB to YIQ conversion. We use precomputed matrix in zmm7, zmm8 and zmm9 to multiply to with RGB values to get YIQ values with vmulps. That will give us 3 separate vectors with Y, I and Q channels. Then we shuffle them around and add them up to get YIQ vector.
We will use this to compute the difference between base image and the image we are comparing against.

vmulps zmm10, zmm1, zmm7 ; y
vmulps zmm11, zmm1, zmm8 ; i
vmulps zmm12, zmm1, zmm9 ; q

; yiq(R)                                    
vxorps zmm13, zmm13, zmm13                  
vshufps zmm13 {k1}, zmm10, zmm10, 0b00000000
vshufps zmm13 {k2}, zmm11, zmm11, 0b00000000
vshufps zmm13 {k3}, zmm12, zmm12, 0b00000000
; yiq(G)                                    
vxorps zmm14, zmm14, zmm14                  
vshufps zmm14 {k1}, zmm10, zmm10, 0b00000001
vshufps zmm14 {k2}, zmm11, zmm11, 0b00000100
vshufps zmm14 {k3}, zmm12, zmm12, 0b00010000
; yiq(B)                                    
vxorps zmm15, zmm15, zmm15                  
vshufps zmm15 {k1}, zmm10, zmm10, 0b00000010
vshufps zmm15 {k2}, zmm11, zmm11, 0b00001000
vshufps zmm15 {k3}, zmm12, zmm12, 0b00100000
; yiq                     
vaddps zmm16, zmm13, zmm14
vaddps zmm16, zmm16, zmm15

The result is that we have YIQ of base image in zmm16 and YIQ of the other image in zmm26. We take the difference of the two using vsubps, square the result and multiply by delta coefficient.

; YIQ diff                
vsubps zmm16, zmm16, zmm26
                          
; YIQ*YIQ                 
vmulps zmm16, zmm16, zmm16
; YIQ*YIQ * delta coef    
vmulps zmm16, zmm16, zmm29

Now we have squares of differences of channels in zmm16 vector, so we need to sum those channels into a single one. This way even there's difference in both channel Y and I, it will be counted as a single difference.

vxorps zmm17, zmm17, zmm17                  
vxorps zmm18, zmm18, zmm18                  
vxorps zmm19, zmm19, zmm19                  
vshufps zmm17 {k1}, zmm16, zmm16, 0b10101010
vshufps zmm18 {k1}, zmm16, zmm16, 0b01010101
vshufps zmm19 {k1}, zmm16, zmm16, 0b00000000
                                            
; delta                                     
vaddps zmm16, zmm19, zmm18                  
vaddps zmm16, zmm16, zmm17

Finally, we count the number of different pixels. We compare the resulting values against maximum allowed delta in xmm28, which gives boolean mask in k6, which we then treat as an array with 1 or 0 values and sum it up. The sum is the number of pixels that differ when comparing base image against the other image.

vcompressps zmm16 {k1}, zmm16
vcmpgtps k6, xmm16, xmm28    
kmov eax, k6                 
popcnt eax, eax

Back to odiff

vxdiff is definitely an interesting project, but its usefulness is limited when compared to odiff, given all its options. Fortunately, vxdiff is now part of odiff. To activate it, you need to specify --enable-asm when invoking odiff.

That's all folks

Shout out to Dmitriy Kovalenko and the organisers of FunOCaml, because as strange as it may seem, this project wouldn't have happened if FunOCaml hadn't been organised.