Dictionary Compression
Dictionary compression dramatically improves compression ratios for small data with known patterns. This is especially useful for APIs, message formats, and structured data.
How Dictionaries Work
Section titled “How Dictionaries Work”Without a dictionary, compressors build patterns from scratch for each input. With a dictionary, they start with pre-computed patterns.
Supported Codecs
Section titled “Supported Codecs”| Codec | Dictionary Support |
|---|---|
| Zstd | ✅ Full support |
| Zlib | ✅ Full support |
| Deflate | ✅ Full support |
| LZ4 | ❌ Not supported |
| Snappy | ❌ Not supported |
| Gzip | ❌ Not supported |
| Brotli | ❌ (has built-in static dict) |
Basic Usage
Section titled “Basic Usage”const cz = @import("compressionz");
// Dictionary containing common patternsconst dictionary = @embedFile("my_dictionary.bin");
// Compress with dictionaryconst compressed = try cz.compressWithOptions(.zstd, data, allocator, .{ .dictionary = dictionary,});defer allocator.free(compressed);
// Decompress with SAME dictionaryconst decompressed = try cz.decompressWithOptions(.zstd, compressed, allocator, .{ .dictionary = dictionary, // Must match!});defer allocator.free(decompressed);Why Dictionary Compression?
Section titled “Why Dictionary Compression?”The Problem
Section titled “The Problem”Small data compresses poorly because:
- Not enough bytes to find patterns
- Huffman trees need bytes to build
- No context to exploit
The Solution
Section titled “The Solution”A dictionary provides:
- Pre-computed common patterns
- Ready-to-use Huffman/entropy codes
- Instant context for compression
Real-World Impact
Section titled “Real-World Impact”| Data Size | Without Dict | With Dict | Improvement |
|---|---|---|---|
| 100 B | 105 B (larger!) | 45 B | 57% smaller |
| 500 B | 420 B | 180 B | 57% smaller |
| 1 KB | 780 B | 380 B | 51% smaller |
| 5 KB | 3.2 KB | 1.9 KB | 41% smaller |
| 50 KB | 28 KB | 24 KB | 14% smaller |
Key insight: Dictionary compression is most effective for small data (< 10 KB).
Creating Dictionaries
Section titled “Creating Dictionaries”Manual Dictionary
Section titled “Manual Dictionary”For simple cases, create a dictionary with common patterns:
const json_dictionary = \\{"id":,"name":,"email":,"status":,"created_at": \\,"updated_at":,"type":"user","type":"admin" \\,"active":true,"active":false,"error":null \\,"message":"success","message":"error";
const compressed = try cz.compressWithOptions(.zstd, json_data, allocator, .{ .dictionary = json_dictionary,});Trained Dictionary (Zstd)
Section titled “Trained Dictionary (Zstd)”For best results, train a dictionary on representative samples:
# Collect 1000+ representative samplesls samples/*.json > files.txt
# Train dictionary (32 KB is a good size)zstd --train --maxdict=32768 -o my_dictionary.bin $(cat files.txt)Then use in Zig:
const dictionary = @embedFile("my_dictionary.bin");
const compressed = try cz.compressWithOptions(.zstd, data, allocator, .{ .dictionary = dictionary,});Dictionary Size Guidelines
Section titled “Dictionary Size Guidelines”| Use Case | Recommended Size |
|---|---|
| JSON API responses | 16-32 KB |
| Log messages | 32-64 KB |
| Protocol buffers | 8-16 KB |
| HTML templates | 64-128 KB |
Larger dictionaries provide diminishing returns and increase memory usage.
Use Cases
Section titled “Use Cases”API Responses
Section titled “API Responses”const cz = @import("compressionz");
// Pre-loaded dictionary for API responsesconst api_dict = @embedFile("api_dictionary.bin");
pub fn compressResponse(data: []const u8, allocator: std.mem.Allocator) ![]u8 { return cz.compressWithOptions(.zstd, data, allocator, .{ .dictionary = api_dict, });}
pub fn decompressRequest(data: []const u8, allocator: std.mem.Allocator) ![]u8 { return cz.decompressWithOptions(.zstd, data, allocator, .{ .dictionary = api_dict, });}Message Queues
Section titled “Message Queues”const cz = @import("compressionz");
const MessageCompressor = struct { dictionary: []const u8, allocator: std.mem.Allocator,
pub fn compress(self: *MessageCompressor, message: []const u8) ![]u8 { return cz.compressWithOptions(.zstd, message, self.allocator, .{ .dictionary = self.dictionary, }); }
pub fn decompress(self: *MessageCompressor, compressed: []const u8) ![]u8 { return cz.decompressWithOptions(.zstd, compressed, self.allocator, .{ .dictionary = self.dictionary, }); }};Database Records
Section titled “Database Records”const cz = @import("compressionz");
pub fn storeRecord(db: *Database, key: []const u8, value: []const u8, allocator: std.mem.Allocator) !void { const compressed = try cz.compressWithOptions(.zstd, value, allocator, .{ .dictionary = db.compression_dict, }); defer allocator.free(compressed);
try db.put(key, compressed);}
pub fn loadRecord(db: *Database, key: []const u8, allocator: std.mem.Allocator) !?[]u8 { const compressed = db.get(key) orelse return null;
return cz.decompressWithOptions(.zstd, compressed, allocator, .{ .dictionary = db.compression_dict, });}Dictionary Versioning
Section titled “Dictionary Versioning”Critical: Decompression requires the exact same dictionary used for compression.
Strategy 1: Single Global Dictionary
Section titled “Strategy 1: Single Global Dictionary”// Never change this dictionary (breaks existing data)const DICTIONARY_V1 = @embedFile("dict_v1.bin");Strategy 2: Versioned Dictionaries
Section titled “Strategy 2: Versioned Dictionaries”const dictionaries = struct { v1: []const u8 = @embedFile("dict_v1.bin"), v2: []const u8 = @embedFile("dict_v2.bin"), v3: []const u8 = @embedFile("dict_v3.bin"),};
pub fn decompress(data: []const u8, version: u8, allocator: std.mem.Allocator) ![]u8 { const dict = switch (version) { 1 => dictionaries.v1, 2 => dictionaries.v2, 3 => dictionaries.v3, else => return error.UnknownDictionaryVersion, };
return cz.decompressWithOptions(.zstd, data, allocator, .{ .dictionary = dict, });}Strategy 3: Dictionary ID
Section titled “Strategy 3: Dictionary ID”Zstd dictionaries include a 32-bit ID:
fn getDictionaryId(dict: []const u8) u32 { // Zstd dictionary magic + ID at offset 4 if (dict.len < 8) return 0; return std.mem.readInt(u32, dict[4..8], .little);}
pub fn selectDictionary(compressed: []const u8) ?[]const u8 { // Parse Zstd frame header to get dictionary ID // Match against known dictionaries // ...}Best Practices
Section titled “Best Practices”- ✅ Train on representative samples
- ✅ Keep dictionaries versioned
- ✅ Store dictionary ID with compressed data
- ✅ Test decompression with dictionary
- ✅ Use for small, structured data
- ❌ Change dictionaries after deployment
- ❌ Use random data as dictionary
- ❌ Over-size dictionaries (diminishing returns)
- ❌ Use for large data (> 50 KB)
- ❌ Forget to include dictionary in deployment
Zlib Dictionary Notes
Section titled “Zlib Dictionary Notes”Zlib dictionaries work slightly differently:
// Zlib dictionary is raw bytes, not trainedconst zlib_dict = "common patterns here...";
const compressed = try cz.compressWithOptions(.zlib, data, allocator, .{ .dictionary = zlib_dict,});
const decompressed = try cz.decompressWithOptions(.zlib, compressed, allocator, .{ .dictionary = zlib_dict,});Zlib dictionaries:
- Don’t need training (just common strings)
- Limited to 32 KB
- Dictionary ID stored via Adler-32
Error Handling
Section titled “Error Handling”const result = cz.decompressWithOptions(.zstd, data, allocator, .{ .dictionary = possibly_wrong_dict,}) catch |err| switch (err) { error.DictionaryMismatch => { // Dictionary doesn't match compressed data std.debug.print("Wrong dictionary for this data\n", .{}); return error.InvalidData; }, error.InvalidData => { // Data corrupted or wrong format return error.InvalidData; }, else => return err,};Performance Impact
Section titled “Performance Impact”| Operation | Without Dict | With Dict | Overhead |
|---|---|---|---|
| Compress | Baseline | +5-15% CPU | Dictionary lookup |
| Decompress | Baseline | +2-5% CPU | Dictionary copy |
| Memory | Baseline | +dict size | Dictionary storage |
The CPU overhead is typically worth the 30-60% size reduction for small data.