SIMD Optimizations

compressionz uses SIMD (Single Instruction, Multiple Data) optimizations to achieve high performance in its pure Zig implementations.

How SIMD Works

SIMD allows processing multiple bytes in a single CPU instruction:

Scalar (1 byte at a time):
  compare byte 0, compare byte 1, compare byte 2, ...

SIMD (16 bytes at a time):
  compare 16 bytes simultaneously

This is critical for compression where we constantly:

Compare sequences for matches
Copy data blocks
Search for patterns

Zig’s SIMD Support

Zig provides portable SIMD via @Vector:

// 16-byte vector
const Vec16 = @Vector(16, u8);

// Load 16 bytes
const a: Vec16 = data[0..16].*;

// Compare 16 bytes at once
const b: Vec16 = other[0..16].*;
const eq = a == b;  // 16 comparisons in 1 instruction

// Convert to bitmask
const mask: u16 = @bitCast(eq);

This compiles to native SIMD on supported platforms:

x86/x64: SSE2, AVX2
ARM: NEON
Falls back to scalar on unsupported platforms

LZ4 SIMD Optimizations

Match Extension

When we find a match, we need to extend it as far as possible:

// Scalar: Check one byte at a time
fn extendMatchScalar(src: []const u8, pos: usize, match_pos: usize) usize {
    var len: usize = 0;
    while (pos + len < src.len and
           match_pos + len < pos and
           src[pos + len] == src[match_pos + len])
    {
        len += 1;
    }
    return len;
}

// SIMD: Check 16 bytes at a time
fn extendMatchSimd(src: []const u8, pos: usize, match_pos: usize) usize {
    var len: usize = 0;

    // Process 16 bytes at a time
    while (pos + len + 16 <= src.len and match_pos + len + 16 <= pos) {
        const v1: @Vector(16, u8) = src[pos + len ..][0..16].*;
        const v2: @Vector(16, u8) = src[match_pos + len ..][0..16].*;

        const eq = v1 == v2;
        const mask: u16 = @bitCast(eq);

        if (mask == 0xFFFF) {
            // All 16 bytes match
            len += 16;
        } else {
            // Some bytes don't match - find first mismatch
            len += @ctz(~mask);
            return len;
        }
    }

    // Handle remaining bytes scalar
    while (pos + len < src.len and
           match_pos + len < pos and
           src[pos + len] == src[match_pos + len])
    {
        len += 1;
    }

    return len;
}

Speedup: ~4-8× for long matches.

Fast Copy

Copying matched data with overlapping regions:

// Scalar copy
fn copyScalar(dest: []u8, src: []const u8, len: usize) void {
    for (0..len) |i| {
        dest[i] = src[i];
    }
}

// SIMD copy (when offset >= 8)
fn copySimd(dest: []u8, src: []const u8, len: usize) void {
    var i: usize = 0;

    // Copy 8 bytes at a time
    while (i + 8 <= len) {
        const chunk: @Vector(8, u8) = src[i..][0..8].*;
        dest[i..][0..8].* = @as([8]u8, chunk);
        i += 8;
    }

    // Remaining bytes
    while (i < len) {
        dest[i] = src[i];
        i += 1;
    }
}

Note: For overlapping copies (offset < 8), we must use scalar copy to preserve semantics.

Snappy SIMD Optimizations

Hash Calculation

Snappy uses a hash table for finding matches:

// Hash 4 bytes for match finding
fn hash4(data: []const u8) u32 {
    const v = std.mem.readInt(u32, data[0..4], .little);
    return (v *% 0x1e35a7bd) >> (32 - HASH_BITS);
}

Match Finding

Similar to LZ4, using 16-byte comparisons:

fn findMatch(src: []const u8, pos: usize, candidate: usize) ?Match {
    // Quick 4-byte check first
    if (std.mem.readInt(u32, src[pos..][0..4], .little) !=
        std.mem.readInt(u32, src[candidate..][0..4], .little))
    {
        return null;
    }

    // Extend match using SIMD
    const len = extendMatchSimd(src, pos + 4, candidate + 4) + 4;

    return Match{ .offset = pos - candidate, .length = len };
}

Compiler Optimizations

Alignment

Aligned loads are faster:

// Tell compiler about alignment
const aligned_ptr: [*]align(16) const u8 = @alignCast(data.ptr);
const vec: @Vector(16, u8) = aligned_ptr[0..16].*;

Loop Unrolling

The compiler often unrolls SIMD loops automatically, but we can hint:

comptime var i: usize = 0;
inline while (i < 64) : (i += 16) {
    const v: @Vector(16, u8) = src[i..][0..16].*;
    // Process...
}

Prefetching

For streaming access patterns:

// Prefetch data we'll need soon
@prefetch(src.ptr + 256, .{ .rw = .read, .locality = 3 });

Benchmark: SIMD vs Scalar

LZ4 compression on 1 MB data:

Implementation	Throughput	Speedup
Scalar	8 GB/s	1×
SIMD (match extend)	24 GB/s	3×
SIMD (match + copy)	36 GB/s	4.5×

The difference is substantial for compressible data with many matches.

Platform Considerations

x86/x64

SSE2 is baseline (always available)
AVX2 available on modern CPUs (wider vectors)
Zig @Vector uses SSE2 by default

ARM

NEON is widely available (ARMv7+, all ARM64)
Zig @Vector maps to NEON on ARM

WebAssembly

SIMD128 is a WebAssembly extension
Zig supports WASM SIMD
Fallback to scalar on browsers without SIMD

Fallback

On platforms without SIMD, @Vector operations compile to scalar loops:

// This works everywhere, just slower without SIMD
const v1: @Vector(16, u8) = data1[0..16].*;
const v2: @Vector(16, u8) = data2[0..16].*;
const eq = v1 == v2;

Code Organization

const SIMD_WIDTH = 16;
const SimdVec = @Vector(SIMD_WIDTH, u8);

/// Extend match using SIMD when possible
fn extendMatch(src: []const u8, pos: usize, match_pos: usize) usize {
    var len: usize = 0;

    // SIMD path for bulk comparison
    while (pos + len + SIMD_WIDTH <= src.len) {
        const v1: SimdVec = src[pos + len ..][0..SIMD_WIDTH].*;
        const v2: SimdVec = src[match_pos + len ..][0..SIMD_WIDTH].*;

        if (@reduce(.And, v1 == v2)) {
            len += SIMD_WIDTH;
        } else {
            const mask: u16 = @bitCast(v1 == v2);
            return len + @ctz(~mask);
        }
    }

    // Scalar fallback for remainder
    while (pos + len < src.len and src[pos + len] == src[match_pos + len]) {
        len += 1;
    }

    return len;
}

Future Optimizations

Potential improvements:

AVX-512 support — 64-byte vectors on supported CPUs
Adaptive selection — Choose SIMD width based on data
ARM SVE — Scalable vectors on modern ARM
Parallel hash — SIMD hash table operations

Summary

Zig’s @Vector provides portable SIMD
Match extension benefits most from SIMD (4-8×)
Fast copy with SIMD improves decompression
Falls back gracefully on unsupported platforms
~4× overall speedup on compressible data