]>
Commit | Line | Data |
---|---|---|
11fdf7f2 TL |
1 | This is a compression algorithm focused on finding long distance matches. |
2 | ||
3 | It is based upon lz4 and uses nearly the same block format (github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md). The number of bytes to encode the offset is four instead of two in lz4 to reflect the longer distance matching. The block format is described in `ldm.h`. | |
4 | ||
5 | ### Build | |
6 | ||
7 | Run `make`. | |
8 | ||
9 | ### Compressing a file | |
10 | ||
11 | `ldm <filename>` | |
12 | ||
13 | Decompression and verification can be enabled by defining `DECOMPRESS_AND_VERIFY` in `main.c`. | |
14 | The output file names are as follows: | |
15 | - `<filename>.ldm` : compressed file | |
16 | - `<filename>.ldm.dec` : decompressed file | |
17 | ||
18 | ### Parameters | |
19 | ||
20 | There are various parameters that can be tuned. These parameters can be tuned in `ldm.h` or, alternatively if `ldm_params.h` is included, in `ldm_params.h` (for easier configuration). | |
21 | ||
22 | The parameters are as follows and must all be defined: | |
23 | - `LDM_MEMORY_USAGE` : the memory usage of the underlying hash table in bytes. | |
24 | - `HASH_BUCKET_SIZE_LOG` : the log size of each bucket in the hash table (used in collision resolution). | |
25 | - `LDM_LAG` : the lag (in bytes) in inserting entries into the hash table. | |
26 | - `LDM_WINDOW_SIZE_LOG` : the log maximum window size when searching for matches. | |
27 | - `LDM_MIN_MATCH_LENGTH` : the minimum match length. | |
28 | - `INSERT_BY_TAG` : insert entries into the hash table as a function of the hash. This increases speed by reducing the number of hash table lookups and match comparisons. Certain hashes will never be inserted. | |
29 | - `USE_CHECKSUM` : store a checksum with the hash table entries for faster comparison. This halves the number of entries the hash table can contain. | |
30 | ||
31 | The optional parameter `HASH_ONLY_EVERY_LOG` is the log inverse frequency of insertion into the hash table. That is, an entry is inserted approximately every `1 << HASH_ONLY_EVERY_LOG` times. If this parameter is not defined, the value is computed as a function of the window size and memory usage to approximate an even coverage of the window. | |
32 | ||
33 | ||
34 | ### Benchmark | |
35 | ||
36 | Below is a comparison of various compression methods on a tar of four versions of llvm (versions `3.9.0`, `3.9.1`, `4.0.0`, `4.0.1`) with a total size of `727900160` B. | |
37 | ||
38 | | Method | Size | Ratio | | |
39 | |:---|---:|---:| | |
40 | |lrzip -p 32 -n -w 1 | `369968714` | `1.97`| | |
41 | |ldm | `209391361` | `3.48`| | |
42 | |lz4 | `189954338` | `3.83`| | |
43 | |lrzip -p 32 -l -w 1 | `163940343` | `4.44`| | |
44 | |zstd -1 | `126080293` | `5.77`| | |
45 | |lrzip -p 32 -n | `124821009` | `5.83`| | |
46 | |lrzip -p 32 -n -w 1 & zstd -1 | `120317909` | `6.05`| | |
47 | |zstd -3 -o | `115290952` | `6.31`| | |
48 | |lrzip -p 32 -g -L 9 -w 1 | `107168979` | `6.79`| | |
49 | |zstd -6 -o | `102772098` | `7.08`| | |
50 | |zstd -T16 -9 | `98040470` | `7.42`| | |
51 | |lrzip -p 32 -n -w 1 & zstd -T32 -19 | `88050289` | `8.27`| | |
52 | |zstd -T32 -19 | `83626098` | `8.70`| | |
53 | |lrzip -p 32 -n & zstd -1 | `36335117` | `20.03`| | |
54 | |ldm & zstd -6 | `32856232` | `22.15`| | |
55 | |lrzip -p 32 -g -L 9 | `32243594` | `22.58`| | |
56 | |lrzip -p 32 -n & zstd -6 | `30954572` | `23.52`| | |
57 | |lrzip -p 32 -n & zstd -T32 -19 | `26472064` | `27.50`| | |
58 | ||
59 | The method marked `ldm` was run with the following parameters: | |
60 | ||
61 | | Parameter | Value | | |
62 | |:---|---:| | |
63 | | `LDM_MEMORY_USAGE` | `23`| | |
64 | |`HASH_BUCKET_SIZE_LOG` | `3`| | |
65 | |`LDM_LAG` | `0`| | |
66 | |`LDM_WINDOW_SIZE_LOG` | `28`| | |
67 | |`LDM_MIN_MATCH_LENGTH`| `64`| | |
68 | |`INSERT_BY_TAG` | `1`| | |
69 | |`USE_CHECKSUM` | `1`| | |
70 | ||
71 | The compression speed was `220.5 MB/s`. | |
72 | ||
73 | ### Parameter selection | |
74 | ||
75 | Below is a brief discussion of the effects of the parameters on the speed and compression ratio. | |
76 | ||
77 | #### Speed | |
78 | ||
79 | A large bottleneck in terms of speed is finding the matches and comparing to see if they are greater than the minimum match length. Generally: | |
80 | - The fewer matches found (or the lower the percentage of the literals matched), the slower the algorithm will behave. | |
81 | - Increasing `HASH_ONLY_EVERY_LOG` results in fewer inserts and, if `INSERT_BY_TAG` is set, fewer lookups in the table. This has a large effect on speed, as well as compression ratio. | |
82 | - If `HASH_ONLY_EVERY_LOG` is not set, its value is calculated based on `LDM_WINDOW_SIZE_LOG` and `LDM_MEMORY_USAGE`. Increasing `LDM_WINDOW_SIZE_LOG` has the effect of increasing `HASH_ONLY_EVERY_LOG` and increasing `LDM_MEMORY_USAGE` decreases `HASH_ONLY_EVERY_LOG`. | |
83 | - `USE_CHECKSUM` generally improves speed with hash table lookups. | |
84 | ||
85 | #### Compression ratio | |
86 | ||
87 | The compression ratio is highly correlated with the coverage of matches. As a long distance matcher, the algorithm was designed to "optimize" for long distance matches outside the zstd compression window. The compression ratio after recompressing the output of the long-distance matcher with zstd was a more important signal in development than the raw compression ratio itself. | |
88 | ||
89 | Generally, increasing `LDM_MEMORY_USAGE` will improve the compression ratio. However when using the default computed value of `HASH_ONLY_EVERY_LOG`, this increases the frequency of insertion and lookup in the table and thus may result in a decrease in speed. | |
90 | ||
91 | Below is a table showing the speed and compression ratio when compressing the llvm tar (as described above) using different settings for `LDM_MEMORY_USAGE`. The other parameters were the same as used in the benchmark above. | |
92 | ||
93 | | `LDM_MEMORY_USAGE` | Ratio | Speed (MB/s) | Ratio after zstd -6 | | |
94 | |---:| ---: | ---: | ---: | | |
95 | | `18` | `1.85` | `232.4` | `10.92` | | |
96 | | `21` | `2.79` | `233.9` | `15.92` | | |
97 | | `23` | `3.48` | `220.5` | `18.29` | | |
98 | | `25` | `4.56` | `140.8` | `19.21` | | |
99 | ||
100 | ### Compression statistics | |
101 | ||
102 | Compression statistics (and the configuration) can be enabled/disabled via `COMPUTE_STATS` and `OUTPUT_CONFIGURATION` in `ldm.h`. |