Skip to content

Dictionary Compression

Dictionary compression dramatically improves compression ratios for small data with known patterns. This is especially useful for APIs, message formats, and structured data.

Without a dictionary, compressors build patterns from scratch for each input. With a dictionary, they start with pre-computed patterns.

CodecDictionary Support
Zstd✅ Full support
Zlib✅ Full support
Deflate✅ Full support
LZ4❌ Not supported
Snappy❌ Not supported
Gzip❌ Not supported
Brotli❌ (has built-in static dict)
const cz = @import("compressionz");
// Dictionary containing common patterns
const dictionary = @embedFile("my_dictionary.bin");
// Compress with dictionary
const compressed = try cz.compressWithOptions(.zstd, data, allocator, .{
.dictionary = dictionary,
});
defer allocator.free(compressed);
// Decompress with SAME dictionary
const decompressed = try cz.decompressWithOptions(.zstd, compressed, allocator, .{
.dictionary = dictionary, // Must match!
});
defer allocator.free(decompressed);

Small data compresses poorly because:

  • Not enough bytes to find patterns
  • Huffman trees need bytes to build
  • No context to exploit

A dictionary provides:

  • Pre-computed common patterns
  • Ready-to-use Huffman/entropy codes
  • Instant context for compression
Data SizeWithout DictWith DictImprovement
100 B105 B (larger!)45 B57% smaller
500 B420 B180 B57% smaller
1 KB780 B380 B51% smaller
5 KB3.2 KB1.9 KB41% smaller
50 KB28 KB24 KB14% smaller

Key insight: Dictionary compression is most effective for small data (< 10 KB).

For simple cases, create a dictionary with common patterns:

const json_dictionary =
\\{"id":,"name":,"email":,"status":,"created_at":
\\,"updated_at":,"type":"user","type":"admin"
\\,"active":true,"active":false,"error":null
\\,"message":"success","message":"error"
;
const compressed = try cz.compressWithOptions(.zstd, json_data, allocator, .{
.dictionary = json_dictionary,
});

For best results, train a dictionary on representative samples:

Terminal window
# Collect 1000+ representative samples
ls samples/*.json > files.txt
# Train dictionary (32 KB is a good size)
zstd --train --maxdict=32768 -o my_dictionary.bin $(cat files.txt)

Then use in Zig:

const dictionary = @embedFile("my_dictionary.bin");
const compressed = try cz.compressWithOptions(.zstd, data, allocator, .{
.dictionary = dictionary,
});
Use CaseRecommended Size
JSON API responses16-32 KB
Log messages32-64 KB
Protocol buffers8-16 KB
HTML templates64-128 KB

Larger dictionaries provide diminishing returns and increase memory usage.

const cz = @import("compressionz");
// Pre-loaded dictionary for API responses
const api_dict = @embedFile("api_dictionary.bin");
pub fn compressResponse(data: []const u8, allocator: std.mem.Allocator) ![]u8 {
return cz.compressWithOptions(.zstd, data, allocator, .{
.dictionary = api_dict,
});
}
pub fn decompressRequest(data: []const u8, allocator: std.mem.Allocator) ![]u8 {
return cz.decompressWithOptions(.zstd, data, allocator, .{
.dictionary = api_dict,
});
}
const cz = @import("compressionz");
const MessageCompressor = struct {
dictionary: []const u8,
allocator: std.mem.Allocator,
pub fn compress(self: *MessageCompressor, message: []const u8) ![]u8 {
return cz.compressWithOptions(.zstd, message, self.allocator, .{
.dictionary = self.dictionary,
});
}
pub fn decompress(self: *MessageCompressor, compressed: []const u8) ![]u8 {
return cz.decompressWithOptions(.zstd, compressed, self.allocator, .{
.dictionary = self.dictionary,
});
}
};
const cz = @import("compressionz");
pub fn storeRecord(db: *Database, key: []const u8, value: []const u8, allocator: std.mem.Allocator) !void {
const compressed = try cz.compressWithOptions(.zstd, value, allocator, .{
.dictionary = db.compression_dict,
});
defer allocator.free(compressed);
try db.put(key, compressed);
}
pub fn loadRecord(db: *Database, key: []const u8, allocator: std.mem.Allocator) !?[]u8 {
const compressed = db.get(key) orelse return null;
return cz.decompressWithOptions(.zstd, compressed, allocator, .{
.dictionary = db.compression_dict,
});
}

Critical: Decompression requires the exact same dictionary used for compression.

// Never change this dictionary (breaks existing data)
const DICTIONARY_V1 = @embedFile("dict_v1.bin");
const dictionaries = struct {
v1: []const u8 = @embedFile("dict_v1.bin"),
v2: []const u8 = @embedFile("dict_v2.bin"),
v3: []const u8 = @embedFile("dict_v3.bin"),
};
pub fn decompress(data: []const u8, version: u8, allocator: std.mem.Allocator) ![]u8 {
const dict = switch (version) {
1 => dictionaries.v1,
2 => dictionaries.v2,
3 => dictionaries.v3,
else => return error.UnknownDictionaryVersion,
};
return cz.decompressWithOptions(.zstd, data, allocator, .{
.dictionary = dict,
});
}

Zstd dictionaries include a 32-bit ID:

fn getDictionaryId(dict: []const u8) u32 {
// Zstd dictionary magic + ID at offset 4
if (dict.len < 8) return 0;
return std.mem.readInt(u32, dict[4..8], .little);
}
pub fn selectDictionary(compressed: []const u8) ?[]const u8 {
// Parse Zstd frame header to get dictionary ID
// Match against known dictionaries
// ...
}
  • ✅ Train on representative samples
  • ✅ Keep dictionaries versioned
  • ✅ Store dictionary ID with compressed data
  • ✅ Test decompression with dictionary
  • ✅ Use for small, structured data
  • ❌ Change dictionaries after deployment
  • ❌ Use random data as dictionary
  • ❌ Over-size dictionaries (diminishing returns)
  • ❌ Use for large data (> 50 KB)
  • ❌ Forget to include dictionary in deployment

Zlib dictionaries work slightly differently:

// Zlib dictionary is raw bytes, not trained
const zlib_dict = "common patterns here...";
const compressed = try cz.compressWithOptions(.zlib, data, allocator, .{
.dictionary = zlib_dict,
});
const decompressed = try cz.decompressWithOptions(.zlib, compressed, allocator, .{
.dictionary = zlib_dict,
});

Zlib dictionaries:

  • Don’t need training (just common strings)
  • Limited to 32 KB
  • Dictionary ID stored via Adler-32
const result = cz.decompressWithOptions(.zstd, data, allocator, .{
.dictionary = possibly_wrong_dict,
}) catch |err| switch (err) {
error.DictionaryMismatch => {
// Dictionary doesn't match compressed data
std.debug.print("Wrong dictionary for this data\n", .{});
return error.InvalidData;
},
error.InvalidData => {
// Data corrupted or wrong format
return error.InvalidData;
},
else => return err,
};
OperationWithout DictWith DictOverhead
CompressBaseline+5-15% CPUDictionary lookup
DecompressBaseline+2-5% CPUDictionary copy
MemoryBaseline+dict sizeDictionary storage

The CPU overhead is typically worth the 30-60% size reduction for small data.