Move and copy files and folders¶
About¶
This document describes how to transfer files and folders to or from the project storage and home folders.
When to use mv vs rsync¶
Use mv for moves within the same storage location:
# Moving within /home (same storage location)
mv /home/username/old_folder /home/username/new_location/
# Moving within /valhalla/projects (same storage location)
mv /valhalla/projects/project1/data1 /valhalla/projects/project1/processing/
Alternative: Using rsync for same-storage operations (with progress monitoring):
# Copy with progress monitoring and I/O limiting
rsync -vrtl --progress --bwlimit=50000 /valhalla/projects/project1/data1/ /valhalla/projects/project1/processing/
# Move with progress monitoring (removes source after successful copy)
rsync -vrtl --progress --remove-source-files /valhalla/projects/project1/data1/ /valhalla/projects/project1/processing/
Why mv is sufficient for same-storage moves:
- Faster operation - Simple file system operation, no data copying
- Preserves all attributes - Permissions and timestamps maintained automatically
- No bandwidth impact - No network or storage system load
Examples of when to use mv :
# Small to medium directory moves (fast and efficient)
mv /home/username/old_project /home/username/archived/
mv /valhalla/projects/project1/raw_data /valhalla/projects/project1/processed/
# Renaming directories
mv /valhalla/projects/project1/data_backup /valhalla/projects/project1/data_final
# Organizing files within same storage
mv /home/username/results_*.txt /home/username/analysis_results/
Use rsync for cross-storage transfers:
- Bandwidth limiting needed to protect system performance
- Resume capability for reliable large transfers
Why rsync is also useful for same-storage operations:
- Progress monitoring -
rsyncprovides excellent real-time progress reporting - I/O control - Can limit I/O overhead even within the same storage system
- Resume capability - Can resume interrupted operations within same storage
- Better error handling - More robust than
cpfor large directory structures
Benefits of rsync for same-storage operations:
- Progress visibility - See real-time transfer speed, percentage complete, and ETA
- I/O throttling - Limit storage I/O to prevent system overload during large operations
- Resume capability - Can resume interrupted operations without starting over
- Verification - Ensures data integrity during transfer
- Better for large datasets - More efficient than
cpfor directories with many files
When to use rsync vs mv for same-storage:
- Use
mv- For simple, fast moves of small to medium directories - Use
rsync- For large datasets, when you need progress monitoring, or when I/O control is important
Examples of when to use rsync for same-storage operations:
# Large dataset with progress monitoring
rsync -vrtl --progress /valhalla/projects/project1/10GB_dataset/ /valhalla/projects/project1/backup/
# I/O throttling during peak hours
rsync -vrtl --progress --bwlimit=5000 /valhalla/projects/project1/large_files/ /valhalla/projects/project1/archive/
# Resume interrupted transfer
rsync -vrtl --progress --partial /valhalla/projects/project1/interrupted_transfer/ /valhalla/projects/project1/destination/
# Move with verification (safer than mv for critical data)
rsync -vrtl --progress --remove-source-files /valhalla/projects/project1/critical_data/ /valhalla/projects/project1/processed/
Why use rsync instead of cp¶
When transferring files between different storage locations on the Discoverer cluster, it is essential to use rsync rather than cp for several critical reasons:
File System and Storage Perspective:
cpperforms simple file system operations - reads source file, writes to destinationrsyncuses intelligent algorithms to minimize data transfer and storage I/Ocpalways copies entire files, even if only small parts have changedrsynccan transfer only the differences between source and destination files
Storage I/O Impact:
cpgenerates maximum I/O load - reads entire source, writes entire destinationrsyncminimizes I/O by comparing files first and transferring only changescpcan overwhelm storage systems during large transfersrsyncwith bandwidth limiting protects storage system performance
Technical Differences:
cp behavior:
- Sequential read-write - Reads source file completely, then writes to destination
- No comparison - Always performs full file copy regardless of existing destination
- Block-level copying - Transfers data in fixed-size blocks without optimization
- Single-threaded - One file at a time, no parallel processing
rsync behavior:
- Delta synchronization - Compares source and destination, transfers only differences
- Checksum comparison - Uses file checksums to detect changes efficiently
- Block-level differences - Can transfer only changed blocks within files
- Resume capability - Can resume interrupted transfers from the point of failure
Storage System Impact:
cp- High I/O load, can cause storage system bottlenecksrsync- Optimized I/O, reduces storage system stresscp- No bandwidth control, can overwhelm network and storagersync- Configurable bandwidth limits protect system resources
Practical Example:
Consider transferring a 10 GB dataset where only 100 MB has changed:
cp- Reads 10 GB from source, writes 10 GB to destination (20 GB total I/O)rsync- Reads 10 GB from source, compares with destination, writes only 100 MB (10.1 GB total I/O)- Storage impact -
cpuses 2x more storage bandwidth thanrsync - Time impact -
rsynccompletes in ~1% of the timecpwould take
File Integrity:
cpprovides basic file copying without advanced featuresrsyncoffers better error handling and progress reporting- Data verification ensures files are transferred correctly
Symbolic Links:
cpby default follows symbolic links, copying the target files instead of preserving the linksrsyncwith-lflag preserves symbolic links as links, maintaining the original structure- Preserving symlinks is important for maintaining software installations and avoiding duplicate files
Resume Capability:
cpcannot resume interrupted transfersrsynccan resume interrupted transfers using--partialand--appendoptions- Large file transfers benefit from resumability on unreliable connections
Integrity and Efficiency:
rsyncprovides better error handling and progress reportingrsynccan verify file integrity during transferrsyncis more efficient for large directory structures
Bandwidth Control:
rsyncwith--bwlimitprevents overloading the I/O capacity of login nodescphas no bandwidth limiting, potentially causing system-wide performance issues- Storage systems can be overwhelmed by uncontrolled data transfers
- Other users may experience slowdowns if bandwidth is not managed properly
Cross-storage transfers: /home/<username> ↔ /valhalla/projects/<your_slurm_project_account_name>¶
Use rsync for cross-storage transfers:
Moving files (removes from source):
rsync -vrtl --remove-source-files /home/`whoami`/folder /valhalla/projects/<your_slurm_project_account_name>/
Copying files (keeps original):
rsync -vrtl /home/`whoami`/folder /valhalla/projects/<your_slurm_project_account_name>/
With bandwidth limiting (recommended for large transfers):
# Limit to 10 MB/s to avoid overloading storage systems
rsync -vrtl --bwlimit=10000 /home/`whoami`/large_folder /valhalla/projects/<your_slurm_project_account_name>/
# Limit to 5 MB/s during peak hours
rsync -vrtl --bwlimit=5000 /home/`whoami`/folder /valhalla/projects/<your_slurm_project_account_name>/
Explanation of flags:
* -v - Verbose output (shows progress)
* -r - Recursive (includes subdirectories)
* -t - Preserve modification times
* -l - Preserve symbolic links (crucial for software installations)
* --remove-source-files - Remove files from source after successful transfer (move operation)
* --bwlimit=KB/s - Limit bandwidth to prevent overloading storage systems (e.g., --bwlimit=10000 for 10 MB/s)
Common pitfalls and best practices¶
Avoid using cp for these reasons:
- Broken symlinks -
cpfollows symlinks instead of preserving them, breaking software installations - No resume capability - Interrupted transfers cannot be resumed
- No bandwidth control - Can overload storage systems during large transfers
- Limited progress reporting - No visibility into transfer status
Examples of cp problems:
# BAD: cp breaks symlinks (don't do this)
cp -r /home/username/conda_env /valhalla/projects/project1/
# Result: Large files copied instead of symlinks, breaking conda environment
# BAD: cp has no progress monitoring
cp -r /home/username/5GB_dataset /valhalla/projects/project1/
# Result: No idea how long it will take or if it's working
# BAD: cp cannot resume interrupted transfers
cp -r /home/username/huge_folder /valhalla/projects/project1/
# If interrupted, must start over from beginning
Always use rsync with the -l flag:
- Preserves symbolic links - Essential for conda environments, Python packages, and software installations
- Maintains directory structure - Keeps the original file system layout intact
- Prevents duplicate files - Avoids copying large files that are linked elsewhere
Examples of proper rsync usage:
# GOOD: Preserves symlinks for conda environments
rsync -vrtl /home/username/conda_env /valhalla/projects/project1/
# Result: Symlinks preserved, conda environment works correctly
# GOOD: Progress monitoring for large transfers
rsync -vrtl --progress /home/username/5GB_dataset /valhalla/projects/project1/
# Result: Shows transfer speed, percentage complete, ETA
# GOOD: Resume interrupted transfers
rsync -vrtl --progress --partial /home/username/huge_folder /valhalla/projects/project1/
# Result: Can resume from where it left off if interrupted
Use bandwidth limiting to protect system performance:
- Large transfers - Always use
--bwlimitfor transfers > 1 GB - Peak hours - Limit to 5-10 MB/s during business hours (9 AM - 5 PM)
- Off-peak hours - Can use higher limits (20-50 MB/s) during nights and weekends
- Multiple users - Reduce bandwidth if other users are transferring files simultaneously
Examples of bandwidth limiting:
# Peak hours (9 AM - 5 PM): Conservative limits
rsync -vrtl --progress --bwlimit=5000 /home/username/large_dataset /valhalla/projects/project1/
# 5 MB/s limit to avoid impacting other users
# Off-peak hours (nights/weekends): Higher limits
rsync -vrtl --progress --bwlimit=20000 /home/username/large_dataset /valhalla/projects/project1/
# 20 MB/s limit when system is less busy
# Multiple users transferring: Reduce bandwidth
rsync -vrtl --progress --bwlimit=2000 /home/username/dataset /valhalla/projects/project1/
# 2 MB/s limit when others are also transferring
Verification steps:
- Check ownership after transfer:
ls -la /destination/path - Verify symlinks are preserved:
ls -la /destination/path | grep "^l" - Test that software still works in the new location
Examples of verification:
# Check file ownership and permissions
ls -la /valhalla/projects/project1/transferred_folder/
# Should show correct ownership and permissions
# Verify symlinks are preserved (not broken)
ls -la /valhalla/projects/project1/conda_env/ | grep "^l"
# Should show symlinks (lines starting with 'l')
# Test conda environment still works
source /valhalla/projects/project1/conda_env/bin/activate
python --version
# Should work without errors
# Check file integrity
rsync -av --dry-run /home/username/source/ /valhalla/projects/project1/destination/
# Should show no differences if transfer was successful
Getting help¶
See Getting help