SQLite ====== Self-contained, serverless SQL database engine .. toctree:: :maxdepth: 1 :caption: Contents: Overview -------- `SQLite `_ is a C-language library that implements a small, fast, self-contained, high-reliability, full-featured SQL database engine. SQLite is the most widely deployed database engine in the world, used in countless applications from web browsers to mobile devices and embedded systems. Role in ML and AI Workflows: SQLite plays a crucial role in machine learning and artificial intelligence workflows, particularly in data pre-processing and post-processing stages. Its lightweight, serverless architecture makes it ideal for managing metadata, feature stores, experiment tracking, and intermediate results in ML pipelines. Data Pre-processing: - **Feature Engineering**: Store and query feature vectors, metadata, and transformations - **Data Validation**: Track data quality metrics, missing values, and validation results - **Data Sampling**: Efficiently sample datasets for training, validation, and testing - **Metadata Management**: Store dataset statistics, schema information, and preprocessing parameters Data Post-processing: - **Results Storage**: Store model predictions, confidence scores, and evaluation metrics - **Experiment Tracking**: Log hyperparameters, model versions, and performance metrics - **Analysis and Reporting**: Query and aggregate results for performance analysis - **Model Registry**: Track model versions, deployment status, and performance history Example Use Cases: Feature Store: .. code-block:: sql -- Store extracted features for ML models CREATE TABLE features ( sample_id INTEGER PRIMARY KEY, feature_vector BLOB, label REAL, extraction_timestamp TEXT ); -- Query features for training SELECT feature_vector, label FROM features WHERE extraction_timestamp > '2024-01-01'; Experiment Tracking: .. code-block:: sql -- Track ML experiments and results CREATE TABLE experiments ( experiment_id INTEGER PRIMARY KEY, model_name TEXT, hyperparameters TEXT, accuracy REAL, training_time REAL, timestamp TEXT ); -- Find best performing models SELECT model_name, accuracy, hyperparameters FROM experiments ORDER BY accuracy DESC LIMIT 10; Prediction Storage: .. code-block:: sql -- Store model predictions for analysis CREATE TABLE predictions ( id INTEGER PRIMARY KEY, input_id INTEGER, predicted_label REAL, confidence REAL, model_version TEXT ); -- Analyse prediction distribution SELECT model_version, AVG(confidence), COUNT(*) FROM predictions GROUP BY model_version; Available versions ------------------ To view available ``sqlite`` versions: .. code-block:: bash module avail sqlite Build recipes and configuration details are maintained in our GitLab repository: * `Build Recipes `_ Build optimizations ------------------- Our SQLite installations are optimised for maximum performance on Discoverer's hardware. We use the recent :doc:`llvm` compilers to build the SQLite library code, which are the default compilers on Discoverer Petascale Supercomputer. Compression Support: SQLite uses zlib for compression of database pages and backup operations. Our build is linked against the SIMD-accelerated zlib implementation from the :doc:`zlib-ng` package (specifically the ``zlib-ng-compat`` module), which provides significant performance improvements for compression and decompression operations compared to the standard system-wide zlib. .. note:: The SQLite build requires the ``zlib-ng-compat`` module to be loaded during compilation. If this module is not available, SQLite will fall back to the system-wide zlib, which lacks SIMD acceleration and may result in reduced performance for compression operations. Compiler optimizations: - **Link Time Optimization (LTO)**: Full LTO (``-flto=full``) is enabled for both compilation and linking, allowing cross-module optimizations that significantly improve performance. - **CPU-Specific Optimizations**: - ``-march=native``: Optimizes for the native CPU architecture, enabling all available instruction sets - ``-mtune=native``: Tunes the generated code specifically for the target CPU - ``-O3``: Maximum optimisation level for best performance - **Position Independent Code**: ``-fPIC`` is used to enable shared library support. - **Debug Symbols**: ``-g`` flag is included for BOLT optimisation compatibility and debugging support. Linker optimizations: - **LLD Linker**: We use LLVM's LLD linker for faster linking and better optimisation support. - **LTO at Link Time**: ``-flto=full -Wl,--lto-O3`` enables full link-time optimisation with optimisation level 3, allowing the linker to perform whole-program optimisations. Build configuration: - **Release Build**: All optimisations are enabled for production use. - **Full-Text Search**: FTS3 extension is enabled (``-DSQLITE_ENABLE_FTS3``) for text search capabilities. - **Readline Support**: Command-line history and editing enabled for the sqlite3 tool. - **Dual Library Builds**: Both shared (``.so``) and static (``.a``) libraries are built and installed, providing flexibility for different use cases. - **LLVM Toolchain**: Complete LLVM toolchain (llvm-ar, llvm-ranlib, llvm-nm, llvm-strip) is used instead of GNU binutils for better LTO support. BOLT Binary Optimisation ------------------------ The ``sqlite3`` command-line tool is further optimised using BOLT (Binary Optimisation and Layout Tool), a post-link optimiser that improves application performance by optimising code layout based on execution profiles. BOLT Optimisation Process: 1. **Profile Collection**: During the build process, a representative SQL workload is executed and profiled using ``perf`` to collect execution patterns. 2. **Profile Conversion**: The collected profile data is converted to BOLT format using ``perf2bolt``. 3. **Binary Optimisation**: The ``sqlite3`` binary is optimised using ``llvm-bolt`` with: - Block reordering using Extended TSP algorithm for better cache utilisation - Function splitting to separate hot and cold code paths - Profile-guided layout optimisation Benefits: - **Performance**: 5-20% improvement in sqlite3 execution speed - **Cache Efficiency**: Better instruction cache utilisation through code reordering - **Branch Prediction**: Improved branch prediction through hot/cold path separation .. note:: BOLT optimisation is applied automatically during the build process. If BOLT tools are not available, the build continues with the standard optimised binary. These optimisations ensure that our SQLite installation provides the fastest possible database performance for CPU-based applications on Discoverer, while maintaining full compatibility with the standard SQLite API. Available libraries ------------------- SQLite provides the libsqlite3 shared library that is installed by default: ``libsqlite3.so`` - SQLite database engine library This library implements the SQLite database engine, providing a self-contained, serverless SQL database with ACID transactions. - **Header file**: ``sqlite3.h`` - **Link flag**: ``-lsqlite3`` - **pkg-config**: ``sqlite3`` .. note:: The library uses optimised implementations and can be used in both C and C++ applications. It is particularly effective for embedded databases, application data storage, and lightweight database needs. Library variants ---------------- The libsqlite3 library is available as both static (``.a``) and shared (``.so``) libraries. The Environment Modules automatically configure the appropriate paths for dynamic linking, which is the recommended approach for HPC environments. Shared libraries (recommended): - ``libsqlite3.so`` is used by default - Automatically configured when loading the module - Recommended for HPC environments Static libraries: - ``libsqlite3.a`` is also available - Use only if your application specifically requires static linking - Requires explicit ``-static`` flag during linking Linking your application ------------------------ After loading the ``sqlite`` module, the environment variables are automatically configured. You can link your application using one of the following methods: Method 1: Using environment variables (recommended) .. code-block:: bash # Load the module first module load sqlite/3/ # Link against libsqlite3 - C code gcc -o myapp myapp.c $CFLAGS $LDFLAGS -lsqlite3 clang -o myapp myapp.c $CFLAGS $LDFLAGS -lsqlite3 # Link against libsqlite3 - C++ code g++ -o myapp myapp.cpp $CXXFLAGS $LDFLAGS -lsqlite3 clang++ -o myapp myapp.cpp $CXXFLAGS $LDFLAGS -lsqlite3 Method 2: Using pkg-config .. code-block:: bash # Load the module first module load sqlite/3/ # Link against libsqlite3 - C code gcc -o myapp myapp.c $(pkg-config --cflags --libs sqlite3) clang -o myapp myapp.c $(pkg-config --cflags --libs sqlite3) # Link against libsqlite3 - C++ code g++ -o myapp myapp.cpp $(pkg-config --cflags --libs sqlite3) clang++ -o myapp myapp.cpp $(pkg-config --cflags --libs sqlite3) Method 3: Manual linking .. code-block:: bash # Load the module first module load sqlite/3/ # Link against libsqlite3 - C code gcc -o myapp myapp.c -I$SQLITE_ROOT/include -L$SQLITE_ROOT/lib64 -lsqlite3 clang -o myapp myapp.c -I$SQLITE_ROOT/include -L$SQLITE_ROOT/lib64 -lsqlite3 # Link against libsqlite3 - C++ code g++ -o myapp myapp.cpp -I$SQLITE_ROOT/include -L$SQLITE_ROOT/lib64 -lsqlite3 clang++ -o myapp myapp.cpp -I$SQLITE_ROOT/include -L$SQLITE_ROOT/lib64 -lsqlite3 Static linking (if required): If your application specifically requires static linking: .. code-block:: bash # C code gcc -o myapp myapp.c $CFLAGS $LDFLAGS -lsqlite3 -static clang -o myapp myapp.c $CFLAGS $LDFLAGS -lsqlite3 -static # C++ code g++ -o myapp myapp.cpp $CXXFLAGS $LDFLAGS -lsqlite3 -static clang++ -o myapp myapp.cpp $CXXFLAGS $LDFLAGS -lsqlite3 -static .. note:: The Environment Modules automatically set ``CFLAGS``, ``CXXFLAGS``, and ``LDFLAGS`` when you load the module. Using these variables is the recommended approach as they remain correct even if the module path changes. Using with Python ----------------- Python's built-in ``sqlite3`` module can use our optimised SQLite library if the library is available in the library search path when Python starts. Python must be compiled with SQLite support (which is standard in most Python distributions). How It Works: Python's ``sqlite3`` module dynamically loads the SQLite library at runtime. If the optimised SQLite library from our installation is in ``LD_LIBRARY_PATH`` when Python starts, it will use that library. Otherwise, Python falls back to the system-wide SQLite library. Method 1: Load SQLite Module Before Python (Recommended) Load the SQLite module before starting Python: .. code-block:: bash # Load SQLite module first module load sqlite/3/ # Start Python - it will automatically use our optimised SQLite python3 # In Python, verify the SQLite version import sqlite3 print(sqlite3.sqlite_version_info) # Should show: (3, 51, 1) or your installed version Method 2: Set LD_LIBRARY_PATH Explicitly If you need more control, set ``LD_LIBRARY_PATH`` explicitly: .. code-block:: bash # Load SQLite module module load sqlite/3/ # Verify the library path is set echo $LD_LIBRARY_PATH # Start Python python3 Verifying SQLite Version and Library Path in Python: After importing the ``sqlite3`` module, check which SQLite version Python is using and the path to the loaded shared library: .. code-block:: python import sqlite3 import os import sys # Get version as tuple print("SQLite version:", sqlite3.sqlite_version_info) # Output: (3, 51, 1) for version 3.51.01.00 # Get version as string print("SQLite version string:", sqlite3.sqlite_version) # Output: '3.51.1' # Find the path to the loaded SQLite shared library # Method 1: Check /proc/self/maps (Linux) - shows actually loaded libraries if os.path.exists('/proc/self/maps'): with open('/proc/self/maps', 'r') as f: for line in f: if 'libsqlite3.so' in line: # Extract the path (last field after spaces) parts = line.strip().split() if len(parts) >= 6: lib_path = parts[-1] print("SQLite library path:", lib_path) # Verify it's from our installation if '/opt/software/sqlite' in lib_path: print(" -> Using optimised SQLite from our installation") else: print(" -> Using system SQLite (not optimised)") break # Method 2: Use ldd to find library dependencies import subprocess try: result = subprocess.run(['ldd', sys.executable], capture_output=True, text=True, timeout=5) for line in result.stdout.split('\n'): if 'libsqlite3.so' in line and '=>' in line: # Extract the path after '=>' path = line.split('=>')[1].strip().split()[0] print("SQLite library path (from ldd):", path) break except (FileNotFoundError, subprocess.TimeoutExpired): pass # Get compile-time options conn = sqlite3.connect(':memory:') cursor = conn.cursor() cursor.execute("PRAGMA compile_options") options = [row[0] for row in cursor.fetchall()] print("Compile options:", options) Example: Using Optimised SQLite in Python Scripts .. code-block:: python #!/usr/bin/env python3 import sqlite3 # Verify we're using the optimised version if sqlite3.sqlite_version_info < (3, 51, 0): print("Warning: Not using optimised SQLite version") # Use SQLite as normal conn = sqlite3.connect('mydatabase.db') cursor = conn.cursor() # Create table cursor.execute(''' CREATE TABLE IF NOT EXISTS data ( id INTEGER PRIMARY KEY, value REAL ) ''') # Insert data cursor.execute('INSERT INTO data (value) VALUES (?)', (42.0,)) conn.commit() # Query data cursor.execute('SELECT * FROM data') print(cursor.fetchall()) conn.close() SLURM Job Example: .. code-block:: bash #!/bin/bash #SBATCH --job-name=python_sqlite #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH --time=01:00:00 # Load modules module load sqlite/3/ module load python/ # Python will use our optimised SQLite (SQLite is single-threaded) srun --cpu-bind=cores python3 my_script.py Troubleshooting: Issue: Python still uses system SQLite - **Solution:** Ensure the SQLite module is loaded before starting Python - **Solution:** Verify ``LD_LIBRARY_PATH`` includes the SQLite library path: ``echo $LD_LIBRARY_PATH | grep sqlite`` - **Solution:** Check which SQLite library is actually loaded using: ``python3 -c "import sqlite3, os; [print(p.split()[-1]) for p in open('/proc/self/maps') if 'libsqlite3.so' in p]"`` Issue: Python not compiled with SQLite support - **Solution:** Use a Python installation that includes SQLite support (standard in most distributions) - **Solution:** Recompile Python with SQLite support if using a custom build .. note:: The optimised SQLite library provides better performance for database operations, especially for large datasets and complex queries. Always verify the version using ``sqlite3.sqlite_version_info`` to ensure you're using the optimised library. Loading external math functions ------------------------------- SQLite does not include mathematical functions (such as ``sin()``, ``cos()``, ``sqrt()``, ``log()``, etc.) by default. These functions must be loaded as extensions. The SQLite build on Discoverer has extension loading enabled, allowing you to add mathematical functions to your SQLite sessions. Prerequisites: - SQLite module loaded: ``module load sqlite/3/`` - Extension loading enabled (default in this build) - Math extension library compiled and placed in your project or home directory .. note:: Extensions should be placed in user-writable locations (home directory or project folders), not in system directories which are read-only. Method 1: Using the ``.load`` command (recommended) The simplest way to load math functions is using SQLite's ``.load`` command in the sqlite3 shell: .. code-block:: bash # Load the module first module load sqlite/3/ # Start sqlite3 sqlite3 mydatabase.db # In sqlite3 shell, load the math extension from your project directory .load ~/myproject/lib/libsqlite_math.so # Or use a relative path: # .load ./lib/libsqlite_math.so # Verify math functions are available SELECT sin(1.0), cos(1.0), sqrt(4.0); Method 2: Using the ``load_extension()`` SQL function If extension loading is enabled, you can use the ``load_extension()`` SQL function: .. code-block:: sql -- Load the extension from your project directory SELECT load_extension('~/myproject/lib/libsqlite_math.so'); -- Use math functions SELECT sin(PI()/2), cos(0), sqrt(16), log(10); Method 3: Loading at startup (automated) To automatically load math functions when starting sqlite3, create a ``.sqliterc`` configuration file in your home directory: .. code-block:: bash # Create ~/.sqliterc cat > ~/.sqliterc << 'EOF' .load ~/myproject/lib/libsqlite_math.so EOF Now math functions will be available automatically when you start sqlite3. Example: 3D geometric queries This example demonstrates how to use math functions for geometric queries on 3D vectors. We'll find all vectors that are inside a bounding box [0,100] × [0,100] × [0,100] and inside a sphere with centre at (50, 50, 50) and radius 14: .. code-block:: sql -- Load math extension .load ~/myproject/lib/libsqlite_math.so -- Create table for 3D vectors CREATE TABLE vectors ( id INTEGER PRIMARY KEY, x REAL, y REAL, z REAL, description TEXT ); -- Insert sample 3D vectors INSERT INTO vectors (x, y, z, description) VALUES (45, 50, 55, 'Near sphere centre'), (50, 50, 50, 'At sphere centre'), (60, 50, 50, 'On sphere surface'), (65, 50, 50, 'Outside sphere'), (10, 20, 30, 'Inside box, far from sphere'), (50, 64, 50, 'On sphere boundary'), (25, 25, 25, 'Inside both'); -- Select vectors inside both box and sphere SELECT id, x, y, z, description, sqrt(pow(x - 50, 2) + pow(y - 50, 2) + pow(z - 50, 2)) AS distance_from_centre FROM vectors WHERE -- Inside bounding box [0,100] × [0,100] × [0,100] x >= 0 AND x <= 100 AND y >= 0 AND y <= 100 AND z >= 0 AND z <= 100 AND -- Inside sphere: distance from centre (50,50,50) <= radius (14) sqrt(pow(x - 50, 2) + pow(y - 50, 2) + pow(z - 50, 2)) <= 14; Available math functions (typical extension): Once loaded, common math functions typically include: - **Trigonometric**: ``sin(x)``, ``cos(x)``, ``tan(x)``, ``asin(x)``, ``acos(x)``, ``atan(x)``, ``atan2(y, x)`` - **Hyperbolic**: ``sinh(x)``, ``cosh(x)``, ``tanh(x)`` - **Exponential and Logarithmic**: ``exp(x)``, ``log(x)``, ``log10(x)``, ``log2(x)`` - **Power Functions**: ``pow(x, y)``, ``sqrt(x)``, ``cbrt(x)`` - **Rounding**: ``ceil(x)``, ``floor(x)``, ``round(x)``, ``trunc(x)`` - **Other**: ``abs(x)``, ``sign(x)``, ``degrees(x)``, ``radians(x)``, ``pi()`` Extension location recommendations: - Place extensions in your project directory: ``~/myproject/lib/`` or ``$PROJECT/lib/`` - Use relative paths when working within a project: ``./lib/libsqlite_math.so`` - Store extensions in version control with your project code - Use environment variables for flexibility: ``$SQLITE_MATH_EXT`` Command-line utilities ---------------------- SQLite provides the ``sqlite3`` command-line tool for interactive database management. After loading the ``sqlite`` module, this utility is available in your ``PATH``. Main tool: ``sqlite3`` - SQLite command-line interface The primary utility for creating, managing, and querying SQLite databases. - Interactive SQL shell with command history (readline support) - Batch mode for executing SQL scripts - Supports all SQLite features including FTS3 full-text search - Optimised with BOLT for maximum performance Example usage: .. code-block:: bash # Load the module module load sqlite/3/ # Create a new database sqlite3 mydatabase.db # In sqlite3 shell: CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, email TEXT); INSERT INTO users (name, email) VALUES ('Alice', 'alice@example.com'); SELECT * FROM users; .quit # Execute SQL from file sqlite3 mydatabase.db < script.sql # Execute single command sqlite3 mydatabase.db "SELECT COUNT(*) FROM users;" .. warning:: When processing large databases or complex queries, use Slurm batch jobs to execute sqlite3 on compute nodes rather than login nodes. Using SQLite in SLURM Jobs -------------------------- SQLite is single-threaded by design. For extensive data processing with SQLite, it is essential to bind SQLite processes to specific CPU cores to avoid process migration and improve cache locality. This applies to both the ``sqlite3`` command-line tool and applications that use the SQLite library directly. CPU Binding for sqlite3 Tool: When using the ``sqlite3`` command-line tool in SLURM jobs, bind the process to a CPU core using ``--cpu-bind``: .. code-block:: bash #!/bin/bash #SBATCH --job-name=sqlite_process #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH --time=01:00:00 module load sqlite/3/ # Bind to a core for better performance (SQLite is single-threaded) srun --cpu-bind=cores sqlite3 mydatabase.db < process_data.sql CPU Binding for Applications Using SQLite Library: For applications that link against the SQLite library, use CPU binding in your SLURM job script. Since SQLite is single-threaded, allocate one CPU per task: .. code-block:: bash #!/bin/bash #SBATCH --job-name=sqlite_app #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH --time=02:00:00 module load sqlite/3/ # Compile your application (if needed) clang -o myapp myapp.c $CFLAGS $LDFLAGS -lsqlite3 # Run with CPU binding (SQLite is single-threaded) srun --cpu-bind=cores ./myapp Why CPU Binding Matters: - **Cache Locality**: Binding the process to a specific core improves L1/L2 cache hit rates - **Reduced Migration**: Prevents process migration between cores, reducing overhead - **Predictable Performance**: Consistent performance without interference from other processes - **NUMA Awareness**: Better memory access patterns on NUMA systems Best Practices: - Use ``--cpu-bind=cores`` to bind to physical cores - Always use ``--cpus-per-task=1`` since SQLite is single-threaded - For parallel processing, use multiple tasks (``--ntasks=N``) to run multiple SQLite processes in parallel, each bound to its own core Example: Parallel Processing with Multiple Single-Threaded Processes: To process multiple databases in parallel, run multiple single-threaded SQLite processes, each bound to its own core: .. code-block:: bash #!/bin/bash #SBATCH --job-name=sqlite_parallel #SBATCH --ntasks=4 #SBATCH --cpus-per-task=1 #SBATCH --time=01:00:00 module load sqlite/3/ # Process multiple databases in parallel, each as a single-threaded process bound to a core srun --cpu-bind=cores --ntasks=4 bash -c 'sqlite3 db_${SLURM_PROCID}.db < process.sql' Getting help ------------ For additional assistance: * See the :doc:`help` documentation