XZ

High compression ratio data compression library and tools

Overview

XZ is a free general-purpose data compression library and tools that provide a high compression ratio.

Warning

The version of XZ that was affected by the embedded backdoor code, as outlined in the CERT.EU 2024-032, was not installed on Discoverer and will never be installed in our software repository.

Note

We provide XZ installation that is faster and more reliable than the system-wide one. Therefore, we recommend using our installation instead of the system-wide one (see below).

Available versions

To view available xz versions:

$ module avail xz

Build recipes and configuration details are maintained in our GitLab repository:

Build optimizations

Our XZ installations are optimized for maximum performance on Discoverer’s hardware. We use the recent LLVM Compiler Infrastructure compilers to build the XZ library code, which are the default compilers on Discoverer Petascale Supercomputer.

Compiler optimizations:

  • Link Time Optimization (LTO): Full LTO (-flto=full) is enabled for both compilation and linking, allowing cross-module optimizations that significantly improve performance.
  • CPU-Specific Optimizations: - -march=native: Optimizes for the native CPU architecture, enabling all available instruction sets - -mtune=native: Tunes the generated code specifically for the target CPU - -mfma: Enables FMA (Fused Multiply-Add) instructions for improved floating-point performance
  • Position Independent Code: -fPIC is used to enable shared library support.

Linker optimizations:

  • LLD Linker: We use LLVM’s LLD linker (CMAKE_LINKER_TYPE=LLD) for faster linking and better optimization support.
  • LTO at Link Time: -flto=full -Wl,--lto-O3 enables full link-time optimization with optimization level 3, allowing the linker to perform whole-program optimizations.

Build configuration:

  • Release Build: CMAKE_BUILD_TYPE=Release ensures all optimizations are enabled.
  • Hardware-Accelerated CRC: DXZ_CLMUL_CRC=ON enables CLMUL (Carry-less Multiplication) hardware acceleration for CRC32 and CRC64 checksums, providing significant performance improvements on modern CPUs.
  • Multi-threading: DXZ_THREADS=yes enables multi-threaded compression and decompression support.
  • Match Finders: Multiple match finder algorithms are enabled (hc3;hc4;bt2;bt3;bt4) to provide the best compression ratio and speed trade-offs.
  • Checksum Support: All checksum types are enabled (crc32;crc64;sha256) for data integrity verification.
  • Memory Optimization: DXZ_ASSUME_RAM="512" assumes 512MB of available RAM, allowing the build system to optimize for this memory configuration.
  • Full Feature Set: DXZ_SMALL=OFF ensures all features are enabled, prioritizing performance over binary size.

Build system:

  • Build Tool: Ninja build system is used for fast parallel builds.
  • Parallel Compilation: Builds use 4 parallel compilation jobs for efficient resource utilization.
  • Testing: All builds are tested using the comprehensive test suite (ctest) before installation, ensuring correctness and reliability.
  • Dual Library Builds: Both shared (.so) and static (.a) libraries are built and installed, providing flexibility for different use cases.

These optimizations ensure that our XZ installation provides the fastest possible compression and decompression performance for CPU-based applications on Discoverer, while maintaining full compatibility with the standard XZ API.

Compiler support

Warning

For not on we will support only LLVM builds of XZ. No other builds will be officially supported.

Supported builds

Production builds:

module avail xz

Legacy builds (retiring soon, deprecated, do not use)

module avail xz/*/*llvm   # LLVM build (this is not default but we do not use the compiler name in the module name)
module avail xz/*/*gcc    # GCC build (deprecated, will be retired soon)
module avail xz/*/*intel  # Intel oneAPI build (deprecated, will be retired soon)
module avail xz/*/*aocc   # AMD AOCC build (deprecated, will be retired soon)

Available libraries

XZ provides the liblzma shared library that is installed by default:

liblzma.so - LZMA compression library

This library implements the LZMA (Lempel-Ziv-Markov chain Algorithm) compression algorithm, providing high compression ratios with good performance.

  • Header file: lzma.h
  • Link flag: -llzma
  • pkg-config: liblzma

Note

The library uses optimized implementations and can be used in both C and C++ applications.

Library variants

The liblzma library is available as both static (.a) and shared (.so) libraries. The Environment Modules automatically configure the appropriate paths for dynamic linking, which is the recommended approach for HPC environments.

Shared libraries (recommended):
  • liblzma.so is used by default
  • Automatically configured when loading the module
  • Recommended for HPC environments
Static libraries:
  • liblzma.a is also available
  • Use only if your application specifically requires static linking
  • Requires explicit -static flag during linking

Linking your application

After loading the xz module, the environment variables are automatically configured. You can link your application using one of the following methods:

Method 1: Using environment variables (recommended)

# Load the module first
module load xz/<version>

# Link against liblzma - C code
gcc -o myapp myapp.c $CFLAGS $LDFLAGS -llzma
clang -o myapp myapp.c $CFLAGS $LDFLAGS -llzma

# Link against liblzma - C++ code
g++ -o myapp myapp.cpp $CXXFLAGS $LDFLAGS -llzma
clang++ -o myapp myapp.cpp $CXXFLAGS $LDFLAGS -llzma

Method 2: Using pkg-config

# Load the module first
module load xz/<version>

# Link against liblzma - C code
gcc -o myapp myapp.c $(pkg-config --cflags --libs liblzma)
clang -o myapp myapp.c $(pkg-config --cflags --libs liblzma)

# Link against liblzma - C++ code
g++ -o myapp myapp.cpp $(pkg-config --cflags --libs liblzma)
clang++ -o myapp myapp.cpp $(pkg-config --cflags --libs liblzma)

Method 3: Manual linking

# Load the module first
module load xz/<version>

# Link against liblzma - C code
gcc -o myapp myapp.c -I$XZ_ROOT/include -L$XZ_ROOT/lib64 -llzma
clang -o myapp myapp.c -I$XZ_ROOT/include -L$XZ_ROOT/lib64 -llzma

# Link against liblzma - C++ code
g++ -o myapp myapp.cpp -I$XZ_ROOT/include -L$XZ_ROOT/lib64 -llzma
clang++ -o myapp myapp.cpp -I$XZ_ROOT/include -L$XZ_ROOT/lib64 -llzma

Static linking (if required):

If your application specifically requires static linking:

# C code
gcc -o myapp myapp.c $CFLAGS $LDFLAGS -llzma -static
clang -o myapp myapp.c $CFLAGS $LDFLAGS -llzma -static

# C++ code
g++ -o myapp myapp.cpp $CXXFLAGS $LDFLAGS -llzma -static
clang++ -o myapp myapp.cpp $CXXFLAGS $LDFLAGS -llzma -static

Note

The Environment Modules automatically set CFLAGS, CXXFLAGS, and LDFLAGS when you load the module. Using these variables is the recommended approach as they remain correct even if the module path changes.

Replacing the system-wide xz installation

To use the liblzma.so library from our installation instead of relying on the system-wide installation:

module load xz/<version>
./your_program   # will automatically use liblzma.so from xz installation

This way your executable will use the xz library from our installation instead of the system-wide one.

Command-line utilities

XZ provides a comprehensive set of command-line utilities for compression, decompression, and working with compressed files. After loading the xz module, these utilities are available in your PATH.

Main compression/decompression tools:

xz - Main compression and decompression tool

The primary utility for compressing and decompressing files in the .xz format. It can also handle .lzma files when invoked under different names (see polymorphism below).

  • Compresses files to .xz format
  • Decompresses .xz and .lzma files
  • Supports various compression levels (0-9)
  • Supports multi-threading for faster compression
xzdec - Simple decompressor
A lightweight decompression-only tool for .xz files. It is smaller than xz and does not support compression, making it useful for embedded systems or when only decompression is needed.
lzmadec - Simple LZMA decompressor
A lightweight decompression-only tool specifically for .lzma files. Similar to xzdec but for the legacy LZMA format.
lzmainfo - LZMA file information
Displays information about .lzma compressed files, including compression method, uncompressed size, and other metadata.

Convenience tools (symlinks to xz):

Several tools are implemented as symlinks to the main xz binary. The xz program detects which name it was invoked under (using argv[0]) and adjusts its behavior accordingly. This polymorphism allows one binary to provide multiple interfaces:

xzcat -> xz
Decompresses .xz files to standard output (equivalent to xz -dc). Useful for piping decompressed data to other commands.
lzcat -> xz
Decompresses .lzma files to standard output. Provides compatibility with legacy LZMA format.
lzma -> xz
Compresses files to .lzma format (legacy format). When invoked as lzma, the tool uses the older LZMA format instead of the newer .xz format.
unxz -> xz
Decompresses .xz files (equivalent to xz -d). Provides an intuitive name for decompression operations.
unlzma -> xz
Decompresses .lzma files. Provides compatibility with legacy LZMA format.

Comparison tools:

xzdiff - Compare compressed files
Compares two compressed files by decompressing them and running diff. Useful for comparing versions of files stored in compressed format.
xzcmp -> xzdiff
Compares compressed files using cmp instead of diff. Useful for binary file comparisons.
lzdiff -> xzdiff
Compares .lzma compressed files using diff.
lzcmp -> xzdiff
Compares .lzma compressed files using cmp.

Search tools:

xzgrep - Search compressed files
Searches compressed files for patterns using grep. Decompresses files on-the-fly and searches the content without requiring manual decompression.
xzegrep -> xzgrep
Searches compressed files using egrep (extended regular expressions).
xzfgrep -> xzgrep
Searches compressed files using fgrep (fixed strings).
lzgrep -> xzgrep
Searches .lzma compressed files using grep.
lzegrep -> xzgrep
Searches .lzma compressed files using egrep.
lzfgrep -> xzgrep
Searches .lzma compressed files using fgrep.

Viewing tools:

xzless - View compressed files with less
Views compressed files using the less pager. Decompresses files on-the-fly for viewing.
xzmore - View compressed files with more
Views compressed files using the more pager. Decompresses files on-the-fly for viewing.
lzless -> xzless
Views .lzma compressed files using less.
lzmore -> xzmore
Views .lzma compressed files using more.

How polymorphism works:

The XZ utilities use a common Unix pattern called “name-based polymorphism” or “argv[0] polymorphism”. When a program is invoked, the operating system passes the program name as the first argument (argv[0]). The xz binary checks this name to determine its behavior:

  • If invoked as xz, it compresses/decompresses .xz files
  • If invoked as lzma, it compresses/decompresses .lzma files
  • If invoked as xzcat or lzcat, it decompresses to stdout
  • If invoked as unxz or unlzma, it forces decompression mode

This design allows: - Space efficiency: One binary provides multiple tools - Consistency: All tools share the same core implementation and behavior - Compatibility: Legacy tool names (like lzma) continue to work - Flexibility: Users can choose the most intuitive name for their task

All symlinks point to the same xz binary, which adapts its behavior based on how it was invoked. This is why you can use xzcat, lzcat, unxz, or unlzma and they all work correctly despite being the same underlying program.

Example usage:

# Load the module
module load xz/<version>

# Compress a file
xz myfile.txt              # Creates myfile.txt.xz

# Decompress a file
unxz myfile.txt.xz         # Restores myfile.txt
# or
xzcat myfile.txt.xz        # Decompresses to stdout

# Search in compressed files
xzgrep "pattern" *.xz

# View compressed file
xzless archive.xz

# Compare compressed files
xzdiff file1.xz file2.xz

Warning

When processing large files or multiple files, use Slurm batch jobs to execute these utilities on compute nodes rather than login nodes.

Getting help

For additional assistance: