HPC & ML Infrastructure Engineering
New to HPC and feeling lost?
From SSH basics to Slurm job debugging
Practical HPC tutorials that go beyond "Hello World"
Not sure where to start?
📺 Latest Video: HPC 101
📰 Recent Posts
| Post | Good for you if… |
|---|---|
| Special Topic: Cloud Storage | Transferring files on cloud storage |
| Lesson 4: Slurm Job Debugging | Your job is stuck in PENDING |
| Lesson 3: Environment Management | Package conflicts & venv confusion |
| Lesson 2: Data Transfer | Moving files to/from the cluster |
| Lesson 1: SSH, Modules, Slurm | Complete HPC beginner |
| Linux 101: Don’t Fear the Terminal | The black screen intimidates you |
About Me
Hi, I’m Will Paik. Welcome to The Login Node.
I specialize in scaling AI/ML models on High-Performance Computing (HPC) systems. In supercomputing, there’s always a natural tension between system administrators (“Keep it stable!”) and researchers (“Run it faster!”). My job is to find the technical sweet spot that makes both of them happy.
Currently, I work as an HPC Machine Learning Performance Engineer. By day, I optimize large-scale clusters for training massive AI models. At night, I build (and occasionally break) my own mini-supercomputer to teach you how it all works.
CORE STACK: Slurm Linux Docker/Apptainer PyTorch Distributed Ansible
"Function over Form. The physical cluster building process documented on The Login Node."
My Home Cluster
“If you can’t log in, you can’t compute.”
Hardware Specs (click to expand)
| Role | Hardware Model | Specs |
|---|---|---|
| Login Node | Lenovo IdeaPad 1 | Ryzen 5 7520U, 8GB RAM |
| Management | Lenovo ThinkCentre M715q | Ryzen 5 2400GE, 16GB RAM |
| Visualization | Lenovo ThinkCentre M715q | Ryzen 5 2400GE, 16GB RAM |
| Worker Nodes | Lenovo ThinkCentre M715q | Ryzen 5 2400GE, 16GB RAM |
| GPU Node | HP Envy TE01 | Core i7-10700F, 32GB RAM GTX 1660 Super (6GB) |
| Storage | (Shared via Mgmt) | 1TB NVMe SSD (NFS Share) |
| Network | Gigabit Managed Switch | 8-Port, VLAN Support |
Software Stack (click to expand)
- OS: Rocky Linux 10
- Scheduler: Slurm 25
- Provisioning: Ansible
- Container: Apptainer
- Monitoring: Prometheus + Grafana (In Progress)