[HPC 101] Data Transfer: How to Move Files In and Out

5 minute read

Published:

We have the computing power. Now we need the Data.

Moving files between your local machine (laptop/workstation) and the HPC cluster is a daily routine for researchers. You have your code, input data, and eventually, the results. This guide covers the best practices for file transfer, from “packing” your files to handling massive datasets.

Table of Contents

(Click the image to watch the tutorial on YouTube)


> 1. The Golden Rule: Pack Before You Move

Think of this process like moving into a new house.

In the previous post, we compared the HPC cluster to a Hotel. Your laptop is your old house. Now, you need to move your belongings (data) to the new place.

Imagine you have 10,000 pairs of socks (small data files). Would you carry them one by one to the moving truck? No, that would take forever. You would put them in a box first.

In HPC, transferring thousands of small files individually kills network performance due to overhead. Always archive your folder first.

Choose Your Box: Tar vs. Zip

# Packing (create archive)
$ tar -czf my_data.tar.gz my_folder
# -c: Create
# -z: Gzip compression
# -f: File name

# Unpacking (extract archive)
$ tar -xf my_data.tar.gz
# -x: Extract
# -f: File name
# (On most modern systems, tar detects compression automatically)
# Packing (create archive)
$ zip -r my_data.zip my_folder
# -r: Recursive (includes all subdirectories)

# Unpacking (extract archive)
$ unzip my_data.zip


> 2. Direct Download (Web to HPC)

Scenario: Your data is hosted on a website.

Do not download it to your laptop just to upload it again to the cluster. That is double work. Just order your “delivery” directly to the hotel (Cluster)!

Use wget or curl on the login node (or a designated data transfer node, if your cluster provides one).

# Option 1: Using wget
$ wget https://example.com/dataset.tar.gz

# Option 2: Using curl
$ curl -o dataset.tar.gz https://example.com/dataset.tar.gz


> 3. Transfer Tools: SCP vs. Rsync

Scenario: The files are on your laptop. (Note: Run these commands on your Local Terminal, not inside the cluster.)

SCP (The “Throw”)

If you have a small file or a single packed archive, use scp (Secure Copy). It is simple and quick.

# Upload: Laptop -> Cluster
$ scp my_data.tar.gz <USER>@<HOST_NAME>:~/
# Example: scp data.tar.gz user123@login.university.edu:~/

# Download: Cluster -> Laptop
$ scp <USER>@<HOST_NAME>:~/results.tar.gz ./
# Example: scp user123@login.university.edu:~/data.tar.gz ./

Rsync (The “Smart Mover”)

What if your file is huge (e.g., 100GB)? And what if your WiFi disconnects at 99%? scp will fail, and you have to start over again from 0%. That is a nightmare.

Use rsync. It checks the difference between source and destination. If the connection drops, it resumes from where it left off.

$ rsync -azP my_big_data <USER>@<CLUSTER>:~/
# Example: rsync -azP data_tar.gz user123@login.university.edu:~/

Understanding the flags (-azP):

  • -a: Archive mode. Preserves permissions, timestamps, and symbolic links.
  • -z: Compress file data during the transfer for faster speed.
  • -P: Shows Progress bar and allows Partial transfer (Resuming).

Rule of Thumb:

  • Small file or Simple transfer? Use SCP.
  • Big file or Unstable network? Use Rsync.


> 4. GUI Clients (WinSCP & FileZilla)

“I hate the terminal. Can I just drag and drop?”

Yes, you can! If you are not comfortable with command-line tools yet, or if you just want to browse files visually, use an SFTP Client.

How to Connect

The settings are exactly the same as your SSH connection.

  1. File Protocol: SFTP
  2. Host name: Your cluster address (e.g., login.university.edu)
  3. Port number: 22 (Default SSH port)
  4. User/Password: Your credentials

Once connected, you will see your laptop’s files on the left and the cluster’s files on the right. Just drag and drop to transfer!

Note for Globus Users: If you need to transfer massive datasets (Terabytes/Petabytes) between institutions, ask your system administrator about Globus. It is a high-performance transfer service often supported by research centers. It’s much faster and more reliable than SCP/SFTP for massive data.


> 5. Code Management with Git

Scenario: Moving your Python/C++ scripts.

Should you use rsync for your code? You can, but please don’t. Treat your code like books in a library. Use Git.

  1. Laptop: Push your code to GitHub/GitLab.

     # On your machine after commit
     $ git push
    
  2. Cluster: Clone or Pull the repository.

     # On the Cluster
     $ git clone https://github.com/username/my-project.git
    

This keeps your version history safe and makes collaboration much easier.


> 6. Storage Quota

Warning: Remember the “Hotel Room” analogy? Your room has an occupancy limit. We call it Quota.

If you fill up your disk space, your jobs will crash, and you might not be able to save a file or cannot even login.

How to check? Commands vary by institution. Common examples include:

  • $ quota -s
  • $ lfs quota -u user123 /home/user123
  • $ check_usage

Please check your user documentation or ask your help desk for the specific command. Always check your available space before transferring a massive dataset.


Summary

  1. Pack your small files (tar or zip).
  2. Use wget for web data.
  3. Use scp for quick, small transfers.
  4. Use rsync -azP for large, robust transfers.
  5. Use git for code.

Nice job! You have learned how to prepare your data. In the next post, we will learn how to manage software environments using Conda.

Happy Computing!