You can SSH into a cluster. But what happens between the login prompt and a job running across 100 GPUs?
That gap is what I want to document.
I’m Will Paik. I work as an HPC Machine Learning Performance Engineer at Northeastern University, where my job is bridging the tension between sysadmin priorities (“keep it stable”) and researcher priorities (“run it faster, right now”). I’ve been doing some version of that for nine years, first at Penn State and now at Northeastern.
I’m also building a 6-node HPC cluster at home from consumer hardware. Not because I need one. Because building a system from scratch teaches you things that years of using other people’s systems do not.
1. Why Start This Blog #
Most HPC documentation lives at one of two extremes: enterprise guides that assume a dedicated IT staff and a six-figure budget, or Stack Overflow threads that solve one specific problem without explaining why it happened. There is not much for someone who wants to understand how the whole system fits together.
The second gap is at the intersection of HPC and ML. Most ML practitioners know how to write a training script. Most HPC practitioners know how to manage a cluster. Very few resources explain the space in between: what you need to understand when you are running distributed training on real infrastructure and debugging problems that span the scheduler, the network, the filesystem, and the model at the same time.
2. What I Am Thinking of Doing #
For now, I am thinking about practical guides for researchers who are new to shared HPC clusters. Not “here is the sbatch man page.” More like: here is what you need to know to actually get work done without accidentally wiping your home directory or getting your account suspended.
After that, I plan to document the home cluster build end to end. Every hardware decision, every config file, every mistake. Networking, storage, job scheduling, Ansible automation, and eventually GPU workloads. I want the posts to read like a real build log, not a tutorial written after everything already worked.
Whether that leads into ML infrastructure, distributed training, cluster benchmarking, or something else, I will figure that out as I go.
3. How I Am Approaching It #
Real hardware. Real output. Not just the final working version, but the failures and the fixes that got there. The cluster I am building has real constraints: consumer CPUs, consumer networking, a gaming PC as a GPU node. The solutions have to work in that context, which means they should be useful for researchers working with similarly constrained infrastructure.
Everything will be bilingual where possible. English and Korean posts will cover the same content.
4. Where to Follow Along #
Videos for the cluster build episodes will go on The Login Node YouTube channel. Code and config files will go on GitHub as the projects develop.
Happy Computing!