Move and copy files and folders =============================== .. toctree:: :maxdepth: 1 :caption: Contents: .. role:: underline :class: underline About ----- This document describes how to transfer files and folders to or from the project storage and home folders. When to use ``mv`` vs ``rsync`` ------------------------------- **Use** ``mv`` **for moves within the same storage location:** .. code-block:: bash # Moving within /home (same storage location) mv /home/username/old_folder /home/username/new_location/ # Moving within /valhalla/projects (same storage location) mv /valhalla/projects/project1/data1 /valhalla/projects/project1/processing/ **Alternative: Using** ``rsync`` **for same-storage operations (with progress monitoring):** .. code-block:: bash # Copy with progress monitoring and I/O limiting rsync -vrtl --progress --bwlimit=50000 /valhalla/projects/project1/data1/ /valhalla/projects/project1/processing/ # Move with progress monitoring (removes source after successful copy) rsync -vrtl --progress --remove-source-files /valhalla/projects/project1/data1/ /valhalla/projects/project1/processing/ **Why** ``mv`` **is sufficient for same-storage moves:** * **Faster operation** - Simple file system operation, no data copying * **Preserves all attributes** - Permissions and timestamps maintained automatically * **No bandwidth impact** - No network or storage system load **Examples of when to use** ``mv`` **:** .. code-block:: bash # Small to medium directory moves (fast and efficient) mv /home/username/old_project /home/username/archived/ mv /valhalla/projects/project1/raw_data /valhalla/projects/project1/processed/ # Renaming directories mv /valhalla/projects/project1/data_backup /valhalla/projects/project1/data_final # Organizing files within same storage mv /home/username/results_*.txt /home/username/analysis_results/ **Use** ``rsync`` **for cross-storage transfers:** * **Bandwidth limiting** needed to protect system performance * **Resume capability** for reliable large transfers **Why** ``rsync`` **is also useful for same-storage operations:** * **Progress monitoring** - ``rsync`` provides excellent real-time progress reporting * **I/O control** - Can limit I/O overhead even within the same storage system * **Resume capability** - Can resume interrupted operations within same storage * **Better error handling** - More robust than ``cp`` for large directory structures **Benefits of** ``rsync`` **for same-storage operations:** * **Progress visibility** - See real-time transfer speed, percentage complete, and ETA * **I/O throttling** - Limit storage I/O to prevent system overload during large operations * **Resume capability** - Can resume interrupted operations without starting over * **Verification** - Ensures data integrity during transfer * **Better for large datasets** - More efficient than ``cp`` for directories with many files **When to use** ``rsync`` **vs** ``mv`` **for same-storage:** * **Use** ``mv`` - For simple, fast moves of small to medium directories * **Use** ``rsync`` - For large datasets, when you need progress monitoring, or when I/O control is important **Examples of when to use** ``rsync`` **for same-storage operations:** .. code-block:: bash # Large dataset with progress monitoring rsync -vrtl --progress /valhalla/projects/project1/10GB_dataset/ /valhalla/projects/project1/backup/ # I/O throttling during peak hours rsync -vrtl --progress --bwlimit=5000 /valhalla/projects/project1/large_files/ /valhalla/projects/project1/archive/ # Resume interrupted transfer rsync -vrtl --progress --partial /valhalla/projects/project1/interrupted_transfer/ /valhalla/projects/project1/destination/ # Move with verification (safer than mv for critical data) rsync -vrtl --progress --remove-source-files /valhalla/projects/project1/critical_data/ /valhalla/projects/project1/processed/ Why use ``rsync`` instead of ``cp`` ----------------------------------- When transferring files between different storage locations on the Discoverer cluster, it is **essential** to use ``rsync`` rather than ``cp`` for several critical reasons: **File System and Storage Perspective:** * ``cp`` performs simple file system operations - reads source file, writes to destination * ``rsync`` uses intelligent algorithms to minimize data transfer and storage I/O * ``cp`` always copies entire files, even if only small parts have changed * ``rsync`` can transfer only the differences between source and destination files **Storage I/O Impact:** * ``cp`` generates maximum I/O load - reads entire source, writes entire destination * ``rsync`` minimizes I/O by comparing files first and transferring only changes * ``cp`` can overwhelm storage systems during large transfers * ``rsync`` with bandwidth limiting protects storage system performance **Technical Differences:** ``cp`` **behavior:** * **Sequential read-write** - Reads source file completely, then writes to destination * **No comparison** - Always performs full file copy regardless of existing destination * **Block-level copying** - Transfers data in fixed-size blocks without optimization * **Single-threaded** - One file at a time, no parallel processing ``rsync`` **behavior:** * **Delta synchronization** - Compares source and destination, transfers only differences * **Checksum comparison** - Uses file checksums to detect changes efficiently * **Block-level differences** - Can transfer only changed blocks within files * **Resume capability** - Can resume interrupted transfers from the point of failure **Storage System Impact:** * ``cp`` - High I/O load, can cause storage system bottlenecks * ``rsync`` - Optimized I/O, reduces storage system stress * ``cp`` - No bandwidth control, can overwhelm network and storage * ``rsync`` - Configurable bandwidth limits protect system resources **Practical Example:** Consider transferring a 10 GB dataset where only 100 MB has changed: * ``cp`` - Reads 10 GB from source, writes 10 GB to destination (20 GB total I/O) * ``rsync`` - Reads 10 GB from source, compares with destination, writes only 100 MB (10.1 GB total I/O) * **Storage impact** - ``cp`` uses 2x more storage bandwidth than ``rsync`` * **Time impact** - ``rsync`` completes in ~1% of the time ``cp`` would take **File Integrity:** * ``cp`` provides basic file copying without advanced features * ``rsync`` offers better error handling and progress reporting * **Data verification** ensures files are transferred correctly **Symbolic Links:** * ``cp`` by default follows symbolic links, copying the target files instead of preserving the links * ``rsync`` with ``-l`` flag preserves symbolic links as links, maintaining the original structure * **Preserving symlinks** is important for maintaining software installations and avoiding duplicate files **Resume Capability:** * ``cp`` cannot resume interrupted transfers * ``rsync`` can resume interrupted transfers using ``--partial`` and ``--append`` options * **Large file transfers** benefit from resumability on unreliable connections **Integrity and Efficiency:** * ``rsync`` provides better error handling and progress reporting * ``rsync`` can verify file integrity during transfer * ``rsync`` is more efficient for large directory structures **Bandwidth Control:** * ``rsync`` with ``--bwlimit`` prevents overloading the I/O capacity of login nodes * ``cp`` has no bandwidth limiting, potentially causing system-wide performance issues * **Storage systems** can be overwhelmed by uncontrolled data transfers * **Other users** may experience slowdowns if bandwidth is not managed properly Cross-storage transfers: ``/home/`` ↔ ``/valhalla/projects/`` -------------------------------------------------------------------------------------------------------- **Use** ``rsync`` **for cross-storage transfers:** **Moving files (removes from source):** .. code-block:: bash rsync -vrtl --remove-source-files /home/`whoami`/folder /valhalla/projects// **Copying files (keeps original):** .. code-block:: bash rsync -vrtl /home/`whoami`/folder /valhalla/projects// **With bandwidth limiting (recommended for large transfers):** .. code-block:: bash # Limit to 10 MB/s to avoid overloading storage systems rsync -vrtl --bwlimit=10000 /home/`whoami`/large_folder /valhalla/projects// # Limit to 5 MB/s during peak hours rsync -vrtl --bwlimit=5000 /home/`whoami`/folder /valhalla/projects// **Explanation of flags:** * ``-v`` - Verbose output (shows progress) * ``-r`` - Recursive (includes subdirectories) * ``-t`` - Preserve modification times * ``-l`` - Preserve symbolic links (crucial for software installations) * ``--remove-source-files`` - Remove files from source after successful transfer (move operation) * ``--bwlimit=KB/s`` - Limit bandwidth to prevent overloading storage systems (e.g., ``--bwlimit=10000`` for 10 MB/s) Common pitfalls and best practices ---------------------------------- **Avoid using** ``cp`` **for these reasons:** * **Broken symlinks** - ``cp`` follows symlinks instead of preserving them, breaking software installations * **No resume capability** - Interrupted transfers cannot be resumed * **No bandwidth control** - Can overload storage systems during large transfers * **Limited progress reporting** - No visibility into transfer status **Examples of** ``cp`` **problems:** .. code-block:: bash # BAD: cp breaks symlinks (don't do this) cp -r /home/username/conda_env /valhalla/projects/project1/ # Result: Large files copied instead of symlinks, breaking conda environment # BAD: cp has no progress monitoring cp -r /home/username/5GB_dataset /valhalla/projects/project1/ # Result: No idea how long it will take or if it's working # BAD: cp cannot resume interrupted transfers cp -r /home/username/huge_folder /valhalla/projects/project1/ # If interrupted, must start over from beginning **Always use** ``rsync`` **with the** ``-l`` **flag:** * **Preserves symbolic links** - Essential for conda environments, Python packages, and software installations * **Maintains directory structure** - Keeps the original file system layout intact * **Prevents duplicate files** - Avoids copying large files that are linked elsewhere **Examples of proper** ``rsync`` **usage:** .. code-block:: bash # GOOD: Preserves symlinks for conda environments rsync -vrtl /home/username/conda_env /valhalla/projects/project1/ # Result: Symlinks preserved, conda environment works correctly # GOOD: Progress monitoring for large transfers rsync -vrtl --progress /home/username/5GB_dataset /valhalla/projects/project1/ # Result: Shows transfer speed, percentage complete, ETA # GOOD: Resume interrupted transfers rsync -vrtl --progress --partial /home/username/huge_folder /valhalla/projects/project1/ # Result: Can resume from where it left off if interrupted **Use bandwidth limiting to protect system performance:** * **Large transfers** - Always use ``--bwlimit`` for transfers > 1 GB * **Peak hours** - Limit to 5-10 MB/s during business hours (9 AM - 5 PM) * **Off-peak hours** - Can use higher limits (20-50 MB/s) during nights and weekends * **Multiple users** - Reduce bandwidth if other users are transferring files simultaneously **Examples of bandwidth limiting:** .. code-block:: bash # Peak hours (9 AM - 5 PM): Conservative limits rsync -vrtl --progress --bwlimit=5000 /home/username/large_dataset /valhalla/projects/project1/ # 5 MB/s limit to avoid impacting other users # Off-peak hours (nights/weekends): Higher limits rsync -vrtl --progress --bwlimit=20000 /home/username/large_dataset /valhalla/projects/project1/ # 20 MB/s limit when system is less busy # Multiple users transferring: Reduce bandwidth rsync -vrtl --progress --bwlimit=2000 /home/username/dataset /valhalla/projects/project1/ # 2 MB/s limit when others are also transferring **Verification steps:** * Check ownership after transfer: ``ls -la /destination/path`` * Verify symlinks are preserved: ``ls -la /destination/path | grep "^l"`` * Test that software still works in the new location **Examples of verification:** .. code-block:: bash # Check file ownership and permissions ls -la /valhalla/projects/project1/transferred_folder/ # Should show correct ownership and permissions # Verify symlinks are preserved (not broken) ls -la /valhalla/projects/project1/conda_env/ | grep "^l" # Should show symlinks (lines starting with 'l') # Test conda environment still works source /valhalla/projects/project1/conda_env/bin/activate python --version # Should work without errors # Check file integrity rsync -av --dry-run /home/username/source/ /valhalla/projects/project1/destination/ # Should show no differences if transfer was successful Getting help ------------ See :doc:`help`