Skip to content

SIMD Optimizations

compressionz uses SIMD (Single Instruction, Multiple Data) optimizations to achieve high performance in its pure Zig implementations.

SIMD allows processing multiple bytes in a single CPU instruction:

Scalar (1 byte at a time):
compare byte 0, compare byte 1, compare byte 2, ...
SIMD (16 bytes at a time):
compare 16 bytes simultaneously

This is critical for compression where we constantly:

  • Compare sequences for matches
  • Copy data blocks
  • Search for patterns

Zig provides portable SIMD via @Vector:

// 16-byte vector
const Vec16 = @Vector(16, u8);
// Load 16 bytes
const a: Vec16 = data[0..16].*;
// Compare 16 bytes at once
const b: Vec16 = other[0..16].*;
const eq = a == b; // 16 comparisons in 1 instruction
// Convert to bitmask
const mask: u16 = @bitCast(eq);

This compiles to native SIMD on supported platforms:

  • x86/x64: SSE2, AVX2
  • ARM: NEON
  • Falls back to scalar on unsupported platforms

When we find a match, we need to extend it as far as possible:

// Scalar: Check one byte at a time
fn extendMatchScalar(src: []const u8, pos: usize, match_pos: usize) usize {
var len: usize = 0;
while (pos + len < src.len and
match_pos + len < pos and
src[pos + len] == src[match_pos + len])
{
len += 1;
}
return len;
}
// SIMD: Check 16 bytes at a time
fn extendMatchSimd(src: []const u8, pos: usize, match_pos: usize) usize {
var len: usize = 0;
// Process 16 bytes at a time
while (pos + len + 16 <= src.len and match_pos + len + 16 <= pos) {
const v1: @Vector(16, u8) = src[pos + len ..][0..16].*;
const v2: @Vector(16, u8) = src[match_pos + len ..][0..16].*;
const eq = v1 == v2;
const mask: u16 = @bitCast(eq);
if (mask == 0xFFFF) {
// All 16 bytes match
len += 16;
} else {
// Some bytes don't match - find first mismatch
len += @ctz(~mask);
return len;
}
}
// Handle remaining bytes scalar
while (pos + len < src.len and
match_pos + len < pos and
src[pos + len] == src[match_pos + len])
{
len += 1;
}
return len;
}

Speedup: ~4-8× for long matches.

Copying matched data with overlapping regions:

// Scalar copy
fn copyScalar(dest: []u8, src: []const u8, len: usize) void {
for (0..len) |i| {
dest[i] = src[i];
}
}
// SIMD copy (when offset >= 8)
fn copySimd(dest: []u8, src: []const u8, len: usize) void {
var i: usize = 0;
// Copy 8 bytes at a time
while (i + 8 <= len) {
const chunk: @Vector(8, u8) = src[i..][0..8].*;
dest[i..][0..8].* = @as([8]u8, chunk);
i += 8;
}
// Remaining bytes
while (i < len) {
dest[i] = src[i];
i += 1;
}
}

Note: For overlapping copies (offset < 8), we must use scalar copy to preserve semantics.

Snappy uses a hash table for finding matches:

// Hash 4 bytes for match finding
fn hash4(data: []const u8) u32 {
const v = std.mem.readInt(u32, data[0..4], .little);
return (v *% 0x1e35a7bd) >> (32 - HASH_BITS);
}

Similar to LZ4, using 16-byte comparisons:

fn findMatch(src: []const u8, pos: usize, candidate: usize) ?Match {
// Quick 4-byte check first
if (std.mem.readInt(u32, src[pos..][0..4], .little) !=
std.mem.readInt(u32, src[candidate..][0..4], .little))
{
return null;
}
// Extend match using SIMD
const len = extendMatchSimd(src, pos + 4, candidate + 4) + 4;
return Match{ .offset = pos - candidate, .length = len };
}

Aligned loads are faster:

// Tell compiler about alignment
const aligned_ptr: [*]align(16) const u8 = @alignCast(data.ptr);
const vec: @Vector(16, u8) = aligned_ptr[0..16].*;

The compiler often unrolls SIMD loops automatically, but we can hint:

comptime var i: usize = 0;
inline while (i < 64) : (i += 16) {
const v: @Vector(16, u8) = src[i..][0..16].*;
// Process...
}

For streaming access patterns:

// Prefetch data we'll need soon
@prefetch(src.ptr + 256, .{ .rw = .read, .locality = 3 });

LZ4 compression on 1 MB data:

ImplementationThroughputSpeedup
Scalar8 GB/s
SIMD (match extend)24 GB/s
SIMD (match + copy)36 GB/s4.5×

The difference is substantial for compressible data with many matches.

  • SSE2 is baseline (always available)
  • AVX2 available on modern CPUs (wider vectors)
  • Zig @Vector uses SSE2 by default
  • NEON is widely available (ARMv7+, all ARM64)
  • Zig @Vector maps to NEON on ARM
  • SIMD128 is a WebAssembly extension
  • Zig supports WASM SIMD
  • Fallback to scalar on browsers without SIMD

On platforms without SIMD, @Vector operations compile to scalar loops:

// This works everywhere, just slower without SIMD
const v1: @Vector(16, u8) = data1[0..16].*;
const v2: @Vector(16, u8) = data2[0..16].*;
const eq = v1 == v2;
lz4/block.zig
const SIMD_WIDTH = 16;
const SimdVec = @Vector(SIMD_WIDTH, u8);
/// Extend match using SIMD when possible
fn extendMatch(src: []const u8, pos: usize, match_pos: usize) usize {
var len: usize = 0;
// SIMD path for bulk comparison
while (pos + len + SIMD_WIDTH <= src.len) {
const v1: SimdVec = src[pos + len ..][0..SIMD_WIDTH].*;
const v2: SimdVec = src[match_pos + len ..][0..SIMD_WIDTH].*;
if (@reduce(.And, v1 == v2)) {
len += SIMD_WIDTH;
} else {
const mask: u16 = @bitCast(v1 == v2);
return len + @ctz(~mask);
}
}
// Scalar fallback for remainder
while (pos + len < src.len and src[pos + len] == src[match_pos + len]) {
len += 1;
}
return len;
}

Potential improvements:

  1. AVX-512 support — 64-byte vectors on supported CPUs
  2. Adaptive selection — Choose SIMD width based on data
  3. ARM SVE — Scalable vectors on modern ARM
  4. Parallel hash — SIMD hash table operations
  • Zig’s @Vector provides portable SIMD
  • Match extension benefits most from SIMD (4-8×)
  • Fast copy with SIMD improves decompression
  • Falls back gracefully on unsupported platforms
  • ~4× overall speedup on compressible data