This is a submission for the 2025 Hacktoberfest Writing Challenge
This is the first time I heard about hacktoberfest - a month-long initiative in October that encourages developers around the world to contribute to open source projects. This looked interesting, so I thought I'd give it a go.
Some time before that I started learning more about SIMD (AVX512 specifically) and was looking for a project where I can apply this in practice. One of the projects I knew was odiff, which I learned about at the FunOCaml conference I went to last year.
This blog post summarises my contribution to odiff, by rewriting odiff core algorithm into AVX512 assembly.
odiff is a very fast pixel-by-pixel image comparison tool. It is comparing images in the YIQ colour space rather than the RGB space to better represent visual differences. It's a really cool project. Go check it out.
It was originally written in OCaml and recently it was rewritten in zig for better SIMD support.
Normally, CPU can only operate on a single memory location at a time (e.g. a byte, word, etc). This means that if we want to apply the same operation to large number of different memory locations, we need to do it in a loop, one element at a time. This is quite tedious and easy to get wrong. It also doesn't use all the power the CPU has.
There's a lot of workflows that could be optimised to do this is the hardware. Image processing, like in odiff, is one example. Other examples include audio and video processing. All of them essentially apply the same processing to a very large number of data samples.
Originally introduced as MMX extension for x86, but over time evolved to a large number of other extensions, with growing capabilities.
Each CPU has its own SIMD extensions: x86 has MMX, SSE and AVX, ARM has NEON, and RISC-V has RVV.
AVX is a variant of x86 SIMD that operates on 256 bits of data at a time.
AVX512 is a evolution of AVX that operates on 512 bit of data. It's really awesome.
512 bits is 64 bytes, which for RGBA image data means we can operate on 16 pixels at a time in one instruction.
If you ever used array programming languages, AVX512 kind of feels similar (though of course it's definitely lower level). You always operate on whole arrays, not single elements. This requires a shift in the way you think about doing things, but it's very rewarding and can give you some nice performance boost.
This lead to the creation of vxdiff, a pure assembly AVX512 based reimplementation of the core odiff algorithm. It only implements the default set of odiff options (and odiff has some interesting options to tweak its behaviour), but it's still a capable tool. It's also easier to start simple rather than trying to implement all the possible options right away.
It was a really fun thing to do, and I learnt a lot about AVX
I will now explain the odiff core algorithm after translation to AVX. This might be a bit chaotic, because I was writing this in a bit of a hurry, so bear with me.
The full source is here, you can follow along.
rdi and rsi point to image data of the base and other images. After executing xmm1 and xmm2 will have 64 bytes of loaded data, which we can now operate on.vmovdqu8 xmm1, [rdi]
vmovdqu8 xmm2, [rsi]
0 with white. First we clear k6 and k7 mask registers with kxor. Then we check alpha channel to see which values are equal to 0 and store the result into k6 mask using vpcmpequb. At the indices where alpha was 0, mask will equal 1 and for all other indices it will be 0. Then we propagate that information to other channels (red, green and blue) using kshiftlb and kor. Then we store 255 at places where resulting mask is 1. We do this twice - once for the base image and second time for the image we compare against.kxor k6, k6, k6
kxor k7, k7, k7
vpcmpequb k6 {k4}, xmm1, xmm0
kshiftlb k6, k6, 1
kor k7, k7, k6
kshiftlb k6, k6, 1
kor k7, k7, k6
kshiftlb k6, k6, 1
kor k7, k7, k6
vmovdqu8 xmm1 {k7}, xmm31
vpmovzxbd and then from dwords to floats using vcvtudq2ps.vpmovzxbd zmm1, xmm1
vcvtudq2ps zmm1, zmm1
vpmovzxbd zmm2, xmm2
vcvtudq2ps zmm2, zmm2
0-255 range to 0.0 - 1.0 range. k4 mask apply the vmulps operation (which does multiplication) only to alpha channel.vmulps zmm1 {k4}, zmm1, zmm3
vmulps zmm2 {k4}, zmm2, zmm3
(RGB - 255.0) * A + 255.0. I'm not actually sure why it's computed this way, but this is what odiff used to do in the previous version. k5 apply the operations only to red, green and blue channels, leaving alpha channel unmodified.vsubps zmm1 {k5}, zmm1, zmm30
vshufps zmm10, zmm1, zmm1, 0xff
vmulps zmm1 {k5}, zmm1, zmm10
vaddps zmm1 {k5}, zmm1, zmm30
zmm7, zmm8 and zmm9 to multiply to with RGB values to get YIQ values with vmulps. That will give us 3 separate vectors with Y, I and Q channels. Then we shuffle them around and add them up to get YIQ vector.vmulps zmm10, zmm1, zmm7 ; y
vmulps zmm11, zmm1, zmm8 ; i
vmulps zmm12, zmm1, zmm9 ; q
; yiq(R)
vxorps zmm13, zmm13, zmm13
vshufps zmm13 {k1}, zmm10, zmm10, 0b00000000
vshufps zmm13 {k2}, zmm11, zmm11, 0b00000000
vshufps zmm13 {k3}, zmm12, zmm12, 0b00000000
; yiq(G)
vxorps zmm14, zmm14, zmm14
vshufps zmm14 {k1}, zmm10, zmm10, 0b00000001
vshufps zmm14 {k2}, zmm11, zmm11, 0b00000100
vshufps zmm14 {k3}, zmm12, zmm12, 0b00010000
; yiq(B)
vxorps zmm15, zmm15, zmm15
vshufps zmm15 {k1}, zmm10, zmm10, 0b00000010
vshufps zmm15 {k2}, zmm11, zmm11, 0b00001000
vshufps zmm15 {k3}, zmm12, zmm12, 0b00100000
; yiq
vaddps zmm16, zmm13, zmm14
vaddps zmm16, zmm16, zmm15
YIQ of base image in zmm16 and YIQ of the other image in zmm26. We take the difference of the two using vsubps, square the result and multiply by delta coefficient.; YIQ diff
vsubps zmm16, zmm16, zmm26
; YIQ*YIQ
vmulps zmm16, zmm16, zmm16
; YIQ*YIQ * delta coef
vmulps zmm16, zmm16, zmm29
zmm16 vector, so we need to sum those channels into a single one. This way even there's difference in both channel Y and I, it will be counted as a single difference.vxorps zmm17, zmm17, zmm17
vxorps zmm18, zmm18, zmm18
vxorps zmm19, zmm19, zmm19
vshufps zmm17 {k1}, zmm16, zmm16, 0b10101010
vshufps zmm18 {k1}, zmm16, zmm16, 0b01010101
vshufps zmm19 {k1}, zmm16, zmm16, 0b00000000
; delta
vaddps zmm16, zmm19, zmm18
vaddps zmm16, zmm16, zmm17
xmm28, which gives boolean mask in k6, which we then treat as an array with 1 or 0 values and sum it up. The sum is the number of pixels that differ when comparing base image against the other image.vcompressps zmm16 {k1}, zmm16
vcmpgtps k6, xmm16, xmm28
kmov eax, k6
popcnt eax, eax
vxdiff is definitely an interesting project, but its usefulness is limited when compared to odiff, given all its options. Fortunately, vxdiff is now part of odiff. To activate it, you need to specify --enable-asm when invoking odiff.
Shout out to Dmitriy Kovalenko and the organisers of FunOCaml, because as strange as it may seem, this project wouldn't have happened if FunOCaml hadn't been organised.