


[{"content":" 🔧 HPC From Scratch \u0026ndash; Building a real 6-node cluster from consumer hardware under $1,300. Hardware selection, OS install, networking, Slurm, Ansible, and GPU workloads. Start here. 🎓 HPC 101 \u0026ndash; SSH, module systems, Slurm fundamentals, and job debugging. For researchers new to HPC. Start here. 🐧 Linux 101 \u0026ndash; Terminal basics for people who find the command line intimidating. Start here. ","date":"14 June 2026","externalUrl":null,"permalink":"/","section":"","summary":"","title":"","type":"page"},{"content":"","date":"14 June 2026","externalUrl":null,"permalink":"/tags/a100/","section":"Tags","summary":"","title":"A100","type":"tags"},{"content":"","date":"14 June 2026","externalUrl":null,"permalink":"/tags/benchmarking/","section":"Tags","summary":"","title":"Benchmarking","type":"tags"},{"content":"","date":"14 June 2026","externalUrl":null,"permalink":"/tags/ddp/","section":"Tags","summary":"","title":"DDP","type":"tags"},{"content":"","date":"14 June 2026","externalUrl":null,"permalink":"/tags/distributed-training/","section":"Tags","summary":"","title":"Distributed Training","type":"tags"},{"content":"","date":"14 June 2026","externalUrl":null,"permalink":"/tags/gpu/","section":"Tags","summary":"","title":"GPU","type":"tags"},{"content":"","date":"14 June 2026","externalUrl":null,"permalink":"/tags/hpc/","section":"Tags","summary":"","title":"HPC","type":"tags"},{"content":"","date":"14 June 2026","externalUrl":null,"permalink":"/series/hpc-special-topics/","section":"Series","summary":"","title":"HPC Special Topics","type":"series"},{"content":"","date":"14 June 2026","externalUrl":null,"permalink":"/portfolio/","section":"Portfolio","summary":"","title":"Portfolio","type":"portfolio"},{"content":"","date":"14 June 2026","externalUrl":null,"permalink":"/posts/","section":"Posts","summary":"","title":"Posts","type":"posts"},{"content":"","date":"14 June 2026","externalUrl":null,"permalink":"/tags/pytorch/","section":"Tags","summary":"","title":"PyTorch","type":"tags"},{"content":"A reproducible benchmark suite for characterizing PyTorch Distributed Data Parallel (DDP) training performance on NVIDIA GPU clusters. Built for pre-production validation of large-scale HPC infrastructure, and generalized for broader use on any Slurm-based system or cloud instance.\nWhat It Does # The benchmark runs training on synthetic on-device data to isolate GPU compute and NCCL communication from storage I/O. It measures two complementary scaling modes:\nWeak scaling keeps per-GPU batch size fixed while adding GPUs. This answers whether each GPU stays productive as the system grows. Throughput per GPU should stay flat; a drop reveals communication overhead.\nStrong scaling keeps the global batch fixed while adding GPUs. This answers how much faster the same workload runs with more resources, expressed as speedup and parallel efficiency relative to a single-GPU baseline.\nResults are collected as JSON per run and aggregated into tables and plots. GPU activity is sampled during measurement via nvidia-smi, and low-utilization configurations are flagged automatically.\nResults: V100 vs A100 # Full analysis is in the blog post.\nWeak scaling efficiency (fp16, 1 to 8 GPUs):\nGPU Model Per-GPU BS 1 GPU img/s 8 GPU efficiency V100 SXM2 16GB ResNet-152 128 503 95.4% V100 SXM2 16GB ViT-B/16 128 381 95.9% A100 SXM4 80GB ResNet-152 512 1190 97.4% A100 SXM4 80GB ViT-B/16 1024 1030 98.0% Generation comparison (fp16, each GPU at its max batch size):\nModel V100 img/s/GPU A100 img/s/GPU Ratio ResNet-152 503 1190 2.37x ViT-B/16 381 1030 2.70x A100 SXM4 80GB V100 SXM2 16GB Why It Was Built # Statewide AI research infrastructure serving hundreds of researchers needs to be validated before opening to users. This benchmark was developed to stress-test GPU compute and inter-node communication on B200 and RTX Pro 6000 Blackwell hardware before cluster launch, and to produce numbers that can be reported to stakeholders in a reproducible way.\nTechnical Details # Language: Python 3.10+, Bash Framework: PyTorch with torch.distributed / NCCL Scheduler: Slurm (torchrun + srun for multi-node) or direct instance execution Models: ResNet-152, ViT-B/16, ResNet-50, ResNet-101 Precision: fp16 and bf16 Utilities: find_max_bs.py for per-GPU batch size calibration Outputs: per-run JSON, terminal tables, matplotlib PNG plots Repository # github.com/willgpaik/pytorch-ddp-scaling-benchmark\n","date":"14 June 2026","externalUrl":null,"permalink":"/portfolio/pytorch-ddp-bench/","section":"Portfolio","summary":"","title":"PyTorch DDP Scaling Benchmark","type":"portfolio"},{"content":"The A100 is faster than the V100. But how much faster, and where does the gap come from?\nBoth generations support NVLink, both run PyTorch DDP, and both show up in academic HPC clusters. Getting clean numbers across the two is harder than it sounds: VRAM differences force different batch sizes, CUDA build constraints limit which PyTorch wheel you can install, and the comparison shifts depending on whether you care about peak throughput or multi-GPU scaling efficiency.\nThis post covers measured results from the same benchmark code on 8xV100 SXM2 16GB and 8xA100 SXM4 80GB, using two model architectures that stress the GPU and interconnect differently: ResNet-152 (compute-heavy CNN) and ViT-B/16 (communication-heavier transformer). The benchmark tool is open source and described in the portfolio entry. All results and code are at GitHub repository.\n1. Why This Comparison Is Harder Than It Looks # The obvious approach is to run the same batch size on both GPUs and compare throughput. That breaks immediately: V100 has 16 GB VRAM, A100 has 80 GB. The largest batch size that fits on V100 is far below what A100 can handle. Forcing A100 to use V100-constrained batch sizes would suppress its throughput artificially, and running V100 at OOM produces no result at all.\nThis benchmark separates the comparison into two parts.\nWeak scaling efficiency compares how well each system scales from 1 to 8 GPUs, with per-GPU batch size fixed. This metric is normalized: it measures what fraction of 1-GPU throughput survives at N GPUs. A result of 95% at 8 GPUs means the system loses 5% of per-GPU throughput to DDP communication overhead. This comparison is valid even when the two systems use different batch sizes.\nAbsolute throughput compares raw images per second per GPU at each system\u0026rsquo;s maximum safe batch size. This is a \u0026ldquo;peak performance\u0026rdquo; comparison. It is not normalized, so the batch size difference is a real confounding variable. Treat it as \u0026ldquo;A100 at its best vs V100 at its best\u0026rdquo; rather than a controlled experiment.\nStrong scaling is the most sensitive to batch size differences. The per-GPU batch shrinks as you add GPUs, so a GPU with less VRAM runs out of headroom faster. At V100\u0026rsquo;s memory limit (GBS=64), the 8-GPU run processes only 8 images per GPU per step, which puts the workload deep in the communication-dominated regime. The A100 results use GBS=512 and reach 64 images per GPU at 8 GPUs. These are not directly comparable, so strong scaling results are reported separately for each system.\n2. Setup # Hardware:\n8xNVIDIA V100 SXM2 16GB (NVLink 2.0) on Lambda Cloud 8xNVIDIA A100 SXM4 80GB (NVLink 3.0) on Lambda Cloud Software:\nV100: PyTorch 2.4.1, CUDA 12.6, Python 3.10 A100: PyTorch 2.12.0, CUDA 13.0, Python 3.12 Note on PyTorch versions: Recent PyTorch wheels (cu130) no longer include sm_70, which is V100\u0026rsquo;s compute capability. The V100 requires an older wheel. This means throughput differences include a small PyTorch version component that cannot be fully controlled. Weak scaling efficiency comparisons are less sensitive to this because they are normalized within each system.\nAll runs use synthetic on-device data, torchrun for DDP, and the SGD optimizer. Each job runs a 60-second warmup followed by 300 to 600 seconds of measurement. Maximum safe batch sizes were found using find_max_bs.py:\nGPU ResNet-152 fp16 ViT-B/16 fp16 V100 SXM2 16GB 128 128 A100 SXM4 80GB 512 1024 3. Weak Scaling Results # Weak scaling holds per-GPU batch size fixed. Throughput per GPU should stay constant as more GPUs are added. Any drop is DDP communication overhead from NCCL gradient allreduce.\nV100 SXM2 16GB, fp16, per-GPU BS=128:\nModel 1 GPU 2 GPU eff. 4 GPU eff. 8 GPU eff. ResNet-152 503 img/s 95.6% 95.8% 95.4% ViT-B/16 381 img/s 96.5% 96.5% 95.9% A100 SXM4 80GB, fp16:\nModel Per-GPU BS 1 GPU 2 GPU eff. 4 GPU eff. 8 GPU eff. ResNet-152 512 1190 img/s 98.9% 98.1% 97.4% ViT-B/16 1024 1030 img/s 98.9% 98.6% 98.0% A100 SXM4 80GB, bf16:\nModel Per-GPU BS 1 GPU 2 GPU eff. 4 GPU eff. 8 GPU eff. ResNet-152 512 1012 img/s 99.3% 98.9% 98.5% ViT-B/16 1024 1033 img/s 99.6% 99.3% 98.9% Both systems scale well. V100 reaches 95 to 96% at 8 GPUs; A100 reaches 97 to 99%. The 2 to 3 percentage point difference reflects NVLink 3.0 (600 GB/s bidirectional) vs NVLink 2.0 (300 GB/s bidirectional). At the batch sizes used here, V100 spends a slightly larger fraction of each step waiting on gradient synchronization.\nAbsolute throughput comparison (fp16, each GPU at its max batch size):\nModel V100 1-GPU A100 1-GPU Ratio ResNet-152 503 img/s 1190 img/s 2.37x ViT-B/16 381 img/s 1030 img/s 2.70x A100 delivers roughly 2.4x more throughput for ResNet-152 and 2.7x for ViT-B/16. The larger gap for ViT reflects A100\u0026rsquo;s improved Tensor Core throughput for GEMM-heavy operations and better memory bandwidth (2.0 TB/s vs 0.9 TB/s on V100).\n4. fp16 vs bf16 on A100 # A100 supports both fp16 and bf16 in hardware. Both formats use 2 bytes per value, so VRAM usage is identical. The performance difference comes from the compute path.\nModel fp16 bf16 Gap ResNet-152 1190 img/s 1012 img/s fp16 is 17% faster ViT-B/16 1030 img/s 1033 img/s no meaningful difference ResNet-152 runs 17% faster in fp16. ViT-B/16 shows no difference. The reason is memory layout: ResNet uses channels_last (NHWC) format, which cuDNN\u0026rsquo;s convolution kernels handle differently per precision. The fp16 NHWC path on A100 is more optimized than bf16. ViT-B/16 does not use channels_last because attention is not a convolution, so the precision difference does not interact with memory layout.\nThis means the \u0026ldquo;use bf16 on Ampere\u0026rdquo; advice common in transformer training literature does not automatically transfer to CNN workloads. For ResNet-scale models, fp16 is the faster choice on A100.\n5. Strong Scaling Results # Strong scaling fixes global batch size and measures how much faster the job finishes with more GPUs. Ideal speedup is N for N GPUs.\nDirect comparison is not valid here. V100 and A100 use different GBS values because V100\u0026rsquo;s VRAM constrains the 1-GPU baseline. The tables below show each system independently.\nV100 SXM2 16GB, fp16, GBS=64 (per-GPU shrinks from 64 to 8 at 8 GPUs):\nModel 1 GPU 8 GPU speedup 8 GPU efficiency ResNet-152 461 img/s 1.47x 18.4% ViT-B/16 367 img/s 2.81x 35.1% A100 SXM4 80GB, fp16, GBS=512 (per-GPU shrinks from 512 to 64 at 8 GPUs):\nModel 1 GPU 8 GPU speedup 8 GPU efficiency ResNet-152 961 img/s 2.55x 31.9% ViT-B/16 1061 img/s 6.42x 80.3% Strong scaling efficiency collapses when per-GPU batch size gets small. At V100 GBS=64 with 8 GPUs, each card processes 8 images per step. The gradient allreduce takes longer than the forward and backward pass combined at that size. ResNet-152 on V100 reaches 18.4% at 8 GPUs: adding 7 more GPUs delivers only 1.47x speedup on a job that ideally would be 8x faster.\nViT-B/16 fares better (35.1% on V100, 80.3% on A100) because ViT\u0026rsquo;s per-step compute is heavier than ResNet at the same batch size. More computation overlaps with communication, so the synchronization wait is a smaller fraction of each step.\n6. What the Numbers Mean # Both systems scale cleanly in weak scaling (above 95% at 8 GPUs). The question is whether the cost difference justifies the throughput difference.\nA100 delivers 2.4 to 2.7x more throughput per GPU. If Lambda\u0026rsquo;s A100 pricing is less than 2.4x the V100 price, A100 gives better cost efficiency per trained sample. If your workload requires a large global batch size and you need strong scaling to help it finish faster, A100 has a significant structural advantage: its larger VRAM maintains efficient per-GPU batch sizes at higher GPU counts, while V100 runs out of headroom quickly.\nIf your workload fits in 16 GB per GPU and you are running weak scaling across many nodes, V100 at 95% efficiency may be cost-effective depending on price.\n7. Troubleshooting # CUDA error: no kernel image is available for execution on the device on V100\nRecent PyTorch wheels (cu130, cu128) do not include sm_70 (V100). Install cu126 instead:\npip install torch torchvision --index-url https://download.pytorch.org/whl/cu126 python -c \u0026#34;import torch; print(torch.cuda.get_arch_list())\u0026#34; # Confirm sm_70 appears in the output OOM on 1-GPU strong scaling baseline\nThe 1-GPU strong scaling run uses per-GPU BS = GBS. If GBS exceeds your GPU\u0026rsquo;s safe batch size, it will OOM before the multi-GPU runs. Use find_max_bs.py to find the ceiling and set GBS at or below it:\npython find_max_bs.py --model resnet152 --precision fp16 High step-time variance warning from analyze_results.py\nThis is expected when per-GPU batch size is very small (strong scaling at high GPU counts). The GPU is communication-bound and step timing becomes irregular. It can be safely ignored for the communication-overhead analysis; it is only a concern if variance appears in weak scaling results at large batch sizes.\n8. Summary # Three takeaways from this benchmark.\nBoth V100 and A100 scale well in weak scaling, exceeding 95% efficiency at 8 GPUs. The 2 to 3 percentage point gap between the two reflects NVLink 3.0 vs NVLink 2.0 bandwidth, not a fundamental difference in scaling behavior.\nA100 delivers 2.4x more throughput for ResNet-152 and 2.7x for ViT-B/16 at each system\u0026rsquo;s maximum batch size. Memory bandwidth and Tensor Core throughput both contribute, with the larger gap for ViT reflecting A100\u0026rsquo;s advantage in GEMM-heavy workloads.\nOn A100, fp16 outperforms bf16 by 17% for ResNet-152 due to more optimized cuDNN NHWC convolution kernels. ViT-B/16 shows no difference between the two precisions. The common advice to \u0026ldquo;use bf16 on Ampere\u0026rdquo; does not transfer to CNN workloads without verification.\nAll benchmark scripts, raw JSON results, and plots are in the GitHub repository.\nHappy Computing!\n","date":"14 June 2026","externalUrl":null,"permalink":"/posts/hpc-special-topics-02/","section":"Posts","summary":"V100 and A100 both scale past 95% efficiency across 8 GPUs, but A100 delivers 2.4 to 2.7x more throughput per GPU. This post covers measured PyTorch DDP scaling results on 8xV100 SXM2 and 8xA100 SXM4, using ResNet-152 and ViT-B/16 with fp16 and bf16, and explains what the numbers actually mean for system selection.","title":"PyTorch DDP Scaling: V100 vs A100 on 8 GPUs with ResNet-152 and ViT-B/16","type":"posts"},{"content":"Each series is a self-contained progression and starts from part 1 and follow through in order. Posts within a series link to each other automatically.\n","date":"14 June 2026","externalUrl":null,"permalink":"/series/","section":"Series","summary":"","title":"Series","type":"series"},{"content":"","date":"14 June 2026","externalUrl":null,"permalink":"/tags/","section":"Tags","summary":"","title":"Tags","type":"tags"},{"content":"","date":"14 June 2026","externalUrl":null,"permalink":"/tags/v100/","section":"Tags","summary":"","title":"V100","type":"tags"},{"content":"","date":"13 June 2026","externalUrl":null,"permalink":"/tags/ai/","section":"Tags","summary":"","title":"AI","type":"tags"},{"content":"","date":"13 June 2026","externalUrl":null,"permalink":"/tags/monte-carlo/","section":"Tags","summary":"","title":"Monte Carlo","type":"tags"},{"content":" Overview # A text-based Texas Hold\u0026rsquo;em poker game built in Python using only the standard library. The AI opponent uses Monte Carlo simulation to estimate win probability from incomplete information and decides whether to raise, call, fold, or bluff.\nGitHub: github.com/willgpaik/montecarlo-poker\nGoals # The initial goal was to implement a working poker game using OOP without external packages, relying only on random, math, copy, and collections from the standard library. The rules were implemented from scratch based on Texas Hold\u0026rsquo;em specs, which meant writing the full hand evaluator, betting logic, and blind structure by hand.\nHow the AI Works # Each AI turn triggers a Monte Carlo simulation:\nDeep copy the remaining deck and shuffle it Randomly complete the community cards up to 5 Randomly assign hole cards to each simulated opponent Evaluate all hands and check if the AI wins Repeat 1000 times and compute win rate = wins / 1000 Each AI is assigned a random personality at game start that determines its decision thresholds. The personality is not revealed to the player.\nPersonality Raise threshold Fold threshold Bluff chance aggressive 0.50 0.25 15% passive 0.75 0.45 5% bluffer 0.60 0.30 20% Thresholds also have ±0.4 random noise applied per decision to prevent predictable behavior. A separate 30% random action layer runs on top regardless of win rate.\nWin rate above raise threshold → raise. Win rate above fold threshold → call. Otherwise → fold. Win rate above 0.9 triggers all-in at 30% chance.\nDevelopment Notes # The initial version had the structure and classes in place but several broken pieces.\nHand evaluator bugs # The royalflush() function compared the full card tuple instead of the value field (card == 1 instead of card[1] == 1), so royal flush never triggered. The straightflush() function checked the correct suit in the outer condition but then filtered for 'heart' cards in all four suit branches, a copy-paste error that broke straight flush detection for club, diamond, and spade entirely.\nStraight detection # cards.sort() on (suit, value) tuples sorts alphabetically by suit first, so straight() received values like [3, 7, 2, 5, 4, 8, 6] instead of a numerically sorted list. Straights were rarely detected. The fix was sorted(set(card[1] for card in cards)) to get deduplicated, numerically sorted values.\nMonte Carlo was deterministic # The simulate loop used copy.deepcopy(deck) but never shuffled the copy. Every simulation drew the same cards in the same order, so the win rate was always 0 or 1. Adding remainingDeck.shuffle() inside the loop fixed this.\nAction Prompt # Flop, turn, and river each start a new callAll() call with betHigh=0. Since roundBet is also initialized to 0, every player immediately satisfied roundBet[idx] == betHigh and was skipped. The game auto-played through all rounds without asking the human for input. The fix was a hasActed[] boolean array that tracks whether each player has acted in the current round, separate from whether their bet amount matches the current high.\nHandling re-raise # The original callCnt counter incremented per call but never reset when a raise occurred. If player A called and player B raised, player A was not prompted again. The rewrite uses hasActed[] and resets it for all other players when a raise occurs.\nMoney deducted twice # callHuman() and callAI() deducted from player.money directly. callAll() then tracked the same amounts in roundBet and added them to the pot. The fix restructured all helper functions to return amounts only, with a single deduction point in callAll(). This touched callHuman, callAI, raiseHuman, and raiseAI.\nFirst player wins on a tie # The function used player.score, which is set by getScore(). That method was never called during gameplay, so every player\u0026rsquo;s score stayed at 0. The fix was to compare the playerScore tuples (already computed by think()) directly, using the high card and low card fields for tiebreaking.\nHand evaluator refactor # The original evaluator had 10 separate functions, one per hand rank, totaling around 250 lines. straightflush() alone was 90 lines of the same logic copy-pasted for each of the four suits. The rewrite uses a single evaluate_hand() function with collections.Counter for value and suit frequency, and a nested find_straight() helper shared by both straight and straight flush detection. The result is around 50 lines and removes all the repetition.\nKnown Limitations # Side pot not implemented. When a player goes all-in for less than the current bet, the correct behavior is to split the pot. This version gives the full pot to the winner regardless of all-in amounts. No opponent modeling. The AI has a fixed personality per game but does not adapt to observed player behavior across hands. Text interface only. A browser version using the same Monte Carlo logic running in JavaScript is a future direction. Stack # Python 3.10+ · Standard library only (random, math, copy, collections)\n","date":"13 June 2026","externalUrl":null,"permalink":"/portfolio/montecarlo-poker/","section":"Portfolio","summary":"","title":"montecarlo-poker","type":"portfolio"},{"content":"","date":"13 June 2026","externalUrl":null,"permalink":"/tags/python/","section":"Tags","summary":"","title":"Python","type":"tags"},{"content":"","date":"13 June 2026","externalUrl":null,"permalink":"/tags/side-project/","section":"Tags","summary":"","title":"Side Project","type":"tags"},{"content":"","date":"13 June 2026","externalUrl":null,"permalink":"/tags/simulation/","section":"Tags","summary":"","title":"Simulation","type":"tags"},{"content":"The scheduler is running. Now teach it who gets what.\nIn Episode 5, we installed Slurm, wired up slurmdbd to MariaDB, and submitted the first jobs. The cluster works.\nBut right now it is a free-for-all. There are no time limits on jobs. One user can flood the queue with a hundred jobs and starve everyone else. A user who has been running non-stop for a week looks identical to someone submitting their very first job. For a single-user home cluster this is fine. For anything shared, it falls apart fast.\nThis episode builds the accounting layer from scratch: account hierarchy in sacctmgr, QOS policies, and fair share scheduling. We will also add time limits to the partitions, which currently accept jobs with no wall time limit at all.\n*(Click the image to watch the tutorial on YouTube)* Prerequisites: This episode assumes you have completed Episode 5. You need a working Slurm installation with slurmdbd connected to MariaDB, and at least one compute node reporting as idle. All commands below assume the cluster is healthy.\n1. How Slurm Accounting Is Organized # Slurm accounting has three levels.\nCluster sits at the top. This is what we registered in Episode 5 with sacctmgr -i add cluster cluster.\nAccounts are groupings below the cluster. Think departments, research groups, or PI labs. Accounts hold the fair share budget. If research holds 80% of the cluster\u0026rsquo;s share allocation and demo holds 20%, that ratio controls how Slurm prioritizes their jobs when the queue is contested.\nUsers belong to one or more accounts. When a user submits a job, Slurm charges usage against their account, which affects that account\u0026rsquo;s fair share standing.\nOur current state from Episode 5:\ncluster └── root ├── root (user) └── wpaik Everything is under the root account with no structure. We will add two sub-accounts and move users into the right place:\ncluster └── root ├── research (share=80) │ ├── wpaik │ └── testuser1 └── demo (share=20) └── testuser2 2. Building the Account Tree # Create the two sub-accounts under root. The parent=root flag places them in the hierarchy below the existing root account.\n[wpaik@arbiter ~]$ sudo sacctmgr -i add account research parent=root \\ Description=\u0026#34;Research Group\u0026#34; Organization=\u0026#34;Cluster\u0026#34; fairshare=80 [wpaik@arbiter ~]$ sudo sacctmgr -i add account demo parent=root \\ Description=\u0026#34;Demo Group\u0026#34; Organization=\u0026#34;Cluster\u0026#34; fairshare=20 Note: sacctmgr write operations (add, modify, delete) require admin access. Since wpaik has AdminLevel=None in Slurm, use sudo for any command that modifies the database. Read-only commands like sacctmgr show do not need sudo.\nBefore adding users to Slurm accounting, make sure testuser1 and testuser2 exist as actual system users. Since this cluster uses FreeIPA, add them there first. For demo purposes, minimal accounts with no home directory are enough. Slurm will accept users in sacctmgr that do not have home directories, but jobs will only run if the OS can resolve the username.\n[wpaik@arbiter ~]$ ipa user-add testuser1 --first=Test --last=User1 [wpaik@arbiter ~]$ ipa user-add testuser2 --first=Test --last=User2 Now add users to the accounting database:\n# wpaik already has an association under root from Episode 5. # Adding to research creates a second association and sets it as default. [wpaik@arbiter ~]$ sudo sacctmgr -i add user wpaik account=research defaultaccount=research [wpaik@arbiter ~]$ sudo sacctmgr -i add user testuser1 account=research defaultaccount=research [wpaik@arbiter ~]$ sudo sacctmgr -i add user testuser2 account=demo defaultaccount=demo A user can belong to multiple accounts simultaneously. wpaik now has associations under both root and research. Their default account (what gets charged when no --account flag is specified) is now research.\nNote on admin accounts: In this series, wpaik handles all cluster administration, including sacctmgr commands, slurm.conf changes, and sudo tasks. That is a common simplification for a home lab. In production HPC environments, sysadmin work typically runs under a dedicated service account, keeping administrative activity out of the fair share calculation. What matters here is that wpaik has AdminLevel=None in sacctmgr, so fair share applies to it exactly like any regular user. Linux sudoer privilege is invisible to the scheduler.\nVerify the tree:\n[wpaik@arbiter ~]$ sacctmgr show associations format=cluster,account,user,share,qos,defaultqos Cluster Account User Share QOS DefQOS ---------- ---------- ---------- --------- -------------------- ------ cluster root 1 cluster root root 1 cluster research 80 normal,high,gpu normal cluster research wpaik 1 normal,high,gpu normal cluster research testuser1 1 normal,high,gpu normal cluster demo 20 normal normal cluster demo testuser2 1 normal normal Note on wpaik appearing twice in sshare: Because wpaik has associations in both root (from Episode 5) and research (new), sshare -l will show wpaik under both accounts. The root entry carries the usage history from Episode 5 jobs. The research entry starts at zero. This is expected and normal. Jobs submitted going forward will be charged to the research account by default.\n3. QOS: Giving Jobs Different Weights # Quality of Service (QOS) lets you attach rules to jobs: how long they can run, how many resources they can request, and how much priority they carry in the queue. Without QOS, every job competes on the same terms.\nWe will create three:\nQOS Priority MaxWall Purpose normal 0 24 hours Default for all jobs high 100 4 hours Short, urgent jobs that jump the queue gpu 50 8 hours GPU partition jobs Create and configure them:\n[wpaik@arbiter ~]$ sudo sacctmgr -i add qos normal [wpaik@arbiter ~]$ sudo sacctmgr -i modify qos normal set Priority=0 MaxWallDurationPerJob=1-00:00:00 [wpaik@arbiter ~]$ sudo sacctmgr -i add qos high [wpaik@arbiter ~]$ sudo sacctmgr -i modify qos high set Priority=100 MaxWallDurationPerJob=04:00:00 [wpaik@arbiter ~]$ sudo sacctmgr -i add qos gpu [wpaik@arbiter ~]$ sudo sacctmgr -i modify qos gpu set Priority=50 MaxWallDurationPerJob=08:00:00 Assign valid QOS to each account. The research group gets access to all three. demo gets normal only.\n[wpaik@arbiter ~]$ sudo sacctmgr -i modify account name=research set qos=normal,high,gpu defaultqos=normal [wpaik@arbiter ~]$ sudo sacctmgr -i modify account name=demo set qos=normal defaultqos=normal To use a non-default QOS in a job script:\n#SBATCH --qos=high If a user from demo tries to submit with --qos=high, Slurm rejects it at submission before the job ever enters the queue:\nsbatch: error: Batch job submission failed: Invalid qos specification Note: The high QOS carries a hard 4-hour wall limit. A user cannot request --qos=high together with --time=08:00:00. The priority boost comes with the cost of a shorter runtime cap. This is intentional.\n4. Fair Share: Making Heavy Usage Cost Something # Without fair share, Slurm schedules by submission order (FIFO). The first job in queue runs first, regardless of whether that user submitted one job or a hundred in the past week.\nFair share changes this by tracking historical usage and adjusting priority. Users who have consumed more than their entitlement get lower priority. Users who have consumed less get higher priority. The effect is self-correcting: heavy usage today means lower priority tomorrow.\nEnabling Fair Share # Add these lines to /etc/slurm/slurm.conf on arbiter:\n[wpaik@arbiter ~]$ sudo vim /etc/slurm/slurm.conf # Priority / Fair Share PriorityType=priority/multifactor PriorityWeightFairShare=100000 PriorityWeightAge=1000 PriorityDecayHalfLife=5-0 PriorityMaxAge=7-0 AccountingStorageEnforce=associations,qos PriorityType=priority/multifactor switches Slurm from FIFO to a weighted multi-factor priority model. This line activates everything else in this section.\nPriorityWeightFairShare=100000 makes fair share the dominant factor in priority calculation. Other factors like job age still count, but usage history drives most of the scheduling decision.\nPriorityWeightAge=1000 adds a small, steadily increasing bonus to jobs that have been waiting longer. This prevents starvation: even a heavy user with a low fair share score will eventually see their job run as the age bonus accumulates.\nPriorityDecayHalfLife=5-0 controls how long the scheduler remembers past usage. Every 5 days, accumulated usage counts as half. A CPU-hour consumed today carries twice the weight of the same CPU-hour from 5 days ago.\nPriorityMaxAge=7-0 caps the age bonus at 7 days. A job stuck in queue for two weeks does not keep accumulating priority indefinitely.\nAccountingStorageEnforce=associations,qos makes Slurm actually enforce the accounting rules at submission time. associations rejects jobs from users not in the accounting database. qos rejects jobs that request a QOS the user\u0026rsquo;s account does not have access to. Without this line, sacctmgr QOS assignments are recorded in the database but never checked. A user in the demo account could still submit with --qos=high and it would run.\nDecay vs. Reset # There are two ways to handle historical usage:\nApproach Parameter Behavior Gradual decay PriorityDecayHalfLife=5-0 Exponential fade, old usage gradually loses weight Hard reset PriorityUsageResetPeriod=MONTHLY All usage zeroes out on a fixed calendar interval Hard reset is conceptually simple: everyone starts clean on the 1st of each month. But it creates a cliff. Usage from the 2nd of the month carries full weight until the reset, then drops to zero overnight. A user who over-consumed in the first week has no incentive to back off for the rest of the month.\nDecay avoids the cliff. With a 5-day half-life, usage from last week counts as roughly one-quarter of today\u0026rsquo;s. There is no sudden reset moment. Priority adjusts continuously. The 5-day half-life is a reasonable starting point: short enough that a burst job does not penalize you for weeks, long enough that the scheduler actually remembers it. Production HPC sites typically land in the 1-7 day range depending on how quickly they want heavy users to recover their standing.\nHow It Works in Practice # With research at share=80 and demo at share=20:\nIf wpaik has been running jobs for three days straight, their FairShare score drops well below 1.0. testuser2 in demo has not run anything. Their FairShare score stays at 1.0. If both submit jobs at the same moment, testuser2\u0026rsquo;s job may run first despite their account having the smaller share allocation, because testuser2 has consumed nothing. This is intentional. Fair share is about actual usage relative to entitlement, not raw entitlement. research having share=80 means they get 80% of the cluster when everyone competes simultaneously. It does not mean their jobs always run first.\n5. Partition Limits # Both partitions currently have MaxTime=UNLIMITED and no DefaultTime. A job submitted without --time gets unlimited wall time, which means it can block resources indefinitely if it gets stuck.\nUpdate the partition definitions in slurm.conf on arbiter. Replace the existing PartitionName= lines with:\nPartitionName=cpu Nodes=interceptor-[01-02] Default=YES MaxTime=1-00:00:00 DefaultTime=01:00:00 State=UP PartitionName=gpu Nodes=corsair-01 Default=NO MaxTime=08:00:00 DefaultTime=01:00:00 AllowQos=normal,gpu State=UP DefaultTime=01:00:00 assigns a 1-hour limit to any job submitted without --time. This is the most important of the two parameters. Without a default, forgetting --time silently requests unlimited runtime.\nMaxTime=1-00:00:00 caps all CPU jobs at 24 hours. Anything legitimately longer should be checkpointing at the 24-hour mark anyway.\nAllowQos=normal,gpu on the GPU partition prevents high QOS jobs from landing on the GPU. The priority shortcut is for short CPU work, not for jumping the GPU queue.\n6. Applying the Changes # The sacctmgr changes (accounts, users, QOS assignments) are already live in the database. No restart required.\nThe slurm.conf changes (priority settings, partition limits) need to be distributed to all nodes and slurmctld needs to restart.\n# Distribute updated slurm.conf to all nodes [wpaik@arbiter ~]$ ansible all_nodes -b -m copy \\ -a \u0026#34;src=/etc/slurm/slurm.conf dest=/etc/slurm/slurm.conf owner=slurm group=slurm mode=0644\u0026#34; # Restart slurmctld on arbiter [wpaik@arbiter ~]$ sudo systemctl restart slurmctld # Tell all slurmd daemons to re-read their config and recompute the hash [wpaik@arbiter ~]$ sudo scontrol reconfigure # Verify [wpaik@arbiter ~]$ sudo systemctl status slurmctld [wpaik@arbiter ~]$ tail -n 20 /var/log/slurm/slurmctld.log The scontrol reconfigure step is important. After distributing slurm.conf and restarting slurmctld, the controller computes a new config hash from the updated file. Without scontrol reconfigure, the slurmd daemons on compute nodes are still holding the old hash, and Slurm will log config mismatch warnings. scontrol reconfigure sends a signal to all slurmd daemons to re-read their copy of slurm.conf and resync.\n7. Verification # Account and QOS structure # [wpaik@arbiter ~]$ sacctmgr show associations format=cluster,account,user,share,qos,defaultqos [wpaik@arbiter ~]$ sacctmgr show qos format=name,priority,maxwall,flags Fair share tree # [wpaik@carrier ~]$ sshare -l Account User RawShares NormShares RawUsage EffectvUsage FairShare -------------------- ---------- ---------- ----------- --------- ------------- ---------- root 1 0.000000 0 0.000000 1.000000 root wpaik 1 0.009804 0 0.000000 1.000000 research 80 0.784314 0 0.000000 inf research wpaik 1 0.500000 0 0.000000 1.000000 research testuser1 1 0.500000 0 0.000000 1.000000 demo 20 0.196078 0 0.000000 inf demo testuser2 1 1.000000 0 0.000000 1.000000 A few things to notice in the output. wpaik appears under both root and research because they have associations in both accounts. This is expected. The inf values for the research and demo account rows mean those accounts have zero usage so far and Slurm cannot compute a normalized ratio. Once jobs run under those accounts, inf is replaced by a real number. Submit a few jobs as wpaik and re-run sshare to watch the scores change.\nJob priority breakdown # [wpaik@carrier ~]$ sprio -l This shows each queued job\u0026rsquo;s priority broken down by factor: FairShare contribution, Age contribution, QOS contribution. Reach for this when you want to understand why one job is ahead of another.\nJob history # [wpaik@carrier ~]$ sacct -u wpaik --format=JobID,JobName,Partition,Account,AllocCPUS,State,Elapsed Partition limits # [wpaik@carrier ~]$ scontrol show partition cpu [wpaik@carrier ~]$ scontrol show partition gpu The GPU partition should show:\nPartitionName=gpu AllowGroups=ALL AllowAccounts=ALL AllowQos=normal,gpu DefaultTime=01:00:00 MaxTime=08:00:00 Nodes=corsair-01 ... Key fields to check: AllowQos=normal,gpu (not ALL), MaxTime=08:00:00, DefaultTime=01:00:00. If AllowQos=ALL or MaxTime=UNLIMITED is still showing, see the troubleshooting section below.\n8. Troubleshooting # slurm.conf hash mismatch warnings in slurmctld.log\nAfter restarting slurmctld, you may see errors like:\nerror: Node interceptor-01 appears to have a different slurm.conf than the slurmctld. This happens even when the file content is identical on all nodes. The cause: slurmctld restarted with the new config and computed a new hash, but the slurmd daemons on compute nodes are still holding the hash from the old file. The files match but the hashes in memory do not.\n[wpaik@arbiter ~]$ sudo scontrol reconfigure This signals all slurmd daemons to re-read their slurm.conf and recompute the hash. The warnings should stop appearing in subsequent log entries.\nscontrol show partition gpu still shows AllowQos=ALL after reconfigure\nFirst confirm the PartitionName=gpu line in slurm.conf was actually updated on arbiter:\n[wpaik@arbiter ~]$ grep \u0026#34;PartitionName=gpu\u0026#34; /etc/slurm/slurm.conf The line should contain AllowQos=normal,gpu. If it does, verify the file was distributed to all nodes:\n[wpaik@arbiter ~]$ ansible all_nodes -b -m shell \\ -a \u0026#34;grep \u0026#39;PartitionName=gpu\u0026#39; /etc/slurm/slurm.conf\u0026#34; If any node has the old version, re-run the ansible copy task. Then restart and reconfigure:\n[wpaik@arbiter ~]$ sudo systemctl restart slurmctld [wpaik@arbiter ~]$ sudo scontrol reconfigure [wpaik@carrier ~]$ scontrol show partition gpu | grep AllowQos slurmctld fails to start after adding PriorityType\nCheck the controller log first:\n[wpaik@arbiter ~]$ tail -n 50 /var/log/slurm/slurmctld.log The most common cause is slurmdbd being unreachable when slurmctld starts. PriorityType=priority/multifactor requires the accounting database to be available at startup.\n[wpaik@arbiter ~]$ sudo systemctl status slurmdbd [wpaik@arbiter ~]$ sudo systemctl restart slurmdbd [wpaik@arbiter ~]$ sudo systemctl restart slurmctld Always start slurmdbd before slurmctld.\nsacctmgr -i add account returns an error saying the account already exists\nThe setup script is not idempotent. If you ran part of it before, some accounts may already be in the database.\n[wpaik@arbiter ~]$ sacctmgr show account If the account already exists, skip the add and use modify instead. If the account exists but with wrong fairshare values:\n[wpaik@arbiter ~]$ sudo sacctmgr -i modify account name=research set fairshare=80 sshare shows all zeros or no FairShare data\nThis means PriorityType=priority/multifactor is not active yet. Either the slurm.conf change was not applied or slurmctld was not restarted after the change.\n# Confirm the setting is live [wpaik@arbiter ~]$ scontrol show config | grep PriorityType If it still shows basic, restart slurmctld and check again.\nJob rejected: \u0026ldquo;Invalid qos specification\u0026rdquo;\nThe user\u0026rsquo;s account does not have that QOS in its allowed list.\n[wpaik@arbiter ~]$ sacctmgr show associations format=account,user,qos where user=testuser2 If the QOS is missing, add it:\n[wpaik@arbiter ~]$ sudo sacctmgr -i modify account name=demo set qos=normal,high 9. What is Next # The cluster now has a proper multi-user accounting structure. Jobs run under accounts with defined share weights, users who consume more yield priority over time, and partitions have time limits that protect other users from runaway jobs.\nNext episode: Lmod. We will install the Lmod module system and set up real environment modules for software installed on the cluster.\nAll configuration files and sacctmgr setup scripts from this episode are in the GitHub repository.\nHappy Computing!\n","date":"7 June 2026","externalUrl":null,"permalink":"/posts/hpc-from-scratch-06/","section":"Posts","summary":"Without accounting, Slurm treats every user the same regardless of how much they have already consumed. Episode 6 of HPC From Scratch builds the full accounting layer: account hierarchy with sacctmgr, QOS policies with wall-time limits, and fair share scheduling with exponential decay. The result is a scheduler that adapts to actual usage and prevents any single user from dominating the queue.","title":"[HPC From Scratch] Episode 6: Slurm Accounting, QOS, and Fair Share","type":"posts"},{"content":"","date":"7 June 2026","externalUrl":null,"permalink":"/tags/cluster/","section":"Tags","summary":"","title":"Cluster","type":"tags"},{"content":"","date":"7 June 2026","externalUrl":null,"permalink":"/tags/home-lab/","section":"Tags","summary":"","title":"Home Lab","type":"tags"},{"content":"","date":"7 June 2026","externalUrl":null,"permalink":"/series/hpc-from-scratch/","section":"Series","summary":"","title":"HPC From Scratch","type":"series"},{"content":"","date":"7 June 2026","externalUrl":null,"permalink":"/tags/linux/","section":"Tags","summary":"","title":"Linux","type":"tags"},{"content":"","date":"7 June 2026","externalUrl":null,"permalink":"/tags/slurm/","section":"Tags","summary":"","title":"Slurm","type":"tags"},{"content":"","date":"25 May 2026","externalUrl":null,"permalink":"/tags/ansible/","section":"Tags","summary":"","title":"Ansible","type":"tags"},{"content":"","date":"25 May 2026","externalUrl":null,"permalink":"/tags/bash/","section":"Tags","summary":"","title":"Bash","type":"tags"},{"content":"A command-line utility for detecting and remediating RPM package inconsistencies across HPC cluster nodes. Built to address a real operational problem: nodes that silently diverge over time cause hard-to-diagnose job failures, and tracking down the cause by hand does not scale.\nWhat it does # The tool compares installed packages between a baseline node and one or more target nodes, separating results into three distinct categories:\nMissing packages are present in the baseline but absent on the target. A dnf install quick-fix command is included in the report.\nExtra packages exist on the target but not in the baseline. These are reported separately and left untouched by default, since they are often installed intentionally (GPU-specific tools, local debugging utilities).\nVersion mismatches are packages present on both sides but at different versions. Each mismatch includes an action field (upgrade or downgrade) derived from RPM\u0026rsquo;s own version comparison logic, so downstream automation knows exactly what to do.\nIn partition sweep mode, the tool queries Slurm for all active nodes, prompts for an interactive baseline selection, audits every target node, and writes per-node report files only for nodes with differences. A separate extras_summary.txt groups all extra packages by node across the full sweep.\nWhy it was built # Managing a multi-node HPC cluster means nodes drift. A one-off dnf install here, a skipped update there, and the environment across nodes is no longer consistent. The existing approach of SSHing into nodes individually and comparing rpm -qa output by hand does not work at scale and misidentifies version differences as missing packages. This tool was built to replace that workflow with something repeatable and automation-friendly.\nSample output # Partition sweep # [INFO] Fetching node list for partition: cpu Found 4 up node(s): compute-[01-04] SSH : ssh Format : text Select a baseline node: [ 1] compute-01 [ 2] compute-02 [ 3] compute-03 [ 4] compute-04 Enter node number or hostname: 1 [INFO] Starting Partition Sweep Partition : cpu Baseline : compute-01 Targets : 3 node(s) Parallel : 1 job(s) ====================================================== Summary: [OK] compute-02 [DIFF] compute-03 (2 issue(s)) [DIFF] compute-04 (1 issue(s)) ====================================================== Results: Clean : 1 / 4 Diffs : 2 / 4 Reports saved to: ./pkg_audit_reports/ ./pkg_audit_reports/audit_compute-03.txt ./pkg_audit_reports/audit_compute-04.txt Extras summary: ./pkg_audit_reports/extras_summary.txt ====================================================== Per-node report (text) # ====================================================== Package Audit Report Baseline : compute-01 Target : compute-03 Generated: Thu May 22 10:30:01 EDT 2026 ====================================================== [MISSING] 1 package(s) in baseline but NOT in compute-03: ------------------------------------------------------ nvtop (baseline: 3.3.1-2.el10_1) \u0026gt;\u0026gt; Quick Fix: ssh compute-03 \u0026#39;sudo dnf install -y nvtop\u0026#39; [VERSION MISMATCH] 1 package(s) with different versions: ------------------------------------------------------ curl baseline: 8.12.1-2.el10_1.2 target: 8.12.1-1.el10_1 action: upgrade ====================================================== Per-node report (JSON) # { \u0026#34;node\u0026#34;: \u0026#34;compute-03\u0026#34;, \u0026#34;baseline\u0026#34;: \u0026#34;compute-01\u0026#34;, \u0026#34;generated\u0026#34;: \u0026#34;2026-05-22T14:30:01Z\u0026#34;, \u0026#34;missing\u0026#34;: [ {\u0026#34;name\u0026#34;: \u0026#34;nvtop\u0026#34;, \u0026#34;baseline_ver\u0026#34;: \u0026#34;3.3.1-2.el10_1\u0026#34;, \u0026#34;action\u0026#34;: \u0026#34;install\u0026#34;} ], \u0026#34;extra\u0026#34;: [], \u0026#34;version_mismatch\u0026#34;: [ { \u0026#34;name\u0026#34;: \u0026#34;curl\u0026#34;, \u0026#34;baseline_ver\u0026#34;: \u0026#34;8.12.1-2.el10_1.2\u0026#34;, \u0026#34;target_ver\u0026#34;: \u0026#34;8.12.1-1.el10_1\u0026#34;, \u0026#34;action\u0026#34;: \u0026#34;upgrade\u0026#34; } ] } Ansible integration # JSON output is structured for direct use with the included Ansible playbooks:\nremediate.yml reads each node\u0026rsquo;s JSON report and installs missing packages, upgrading or downgrading version mismatches as the action field specifies. remove_extra.yml is kept as a separate file to require a deliberate choice before removing anything. It supports --check dry runs and is designed to be reviewed against extras_summary.txt before execution. Technical details # Language: Bash 4+ Package query: rpm --queryformat for clean name/version separation Version comparison: python3-rpm for RPM-native version ordering Scheduler integration: Slurm (sinfo, scontrol) for partition sweep mode Output formats: text (human-readable), JSON (Ansible-ready), CSV (scripting/spreadsheet) Parallelism: GNU Parallel with xargs -P fallback; sequential by default for login node safety SSH: plain ssh by default, optional sudo mode via -s flag for clusters with restricted inter-node access Target platform: RPM-based Linux (Rocky Linux 9/10, RHEL, CentOS) Repository # github.com/willgpaik/pkg_audit\n","date":"25 May 2026","externalUrl":null,"permalink":"/portfolio/pkg-audit/","section":"Portfolio","summary":"","title":"pkg_audit: Cluster Package Consistency Audit Tool","type":"portfolio"},{"content":"","date":"25 May 2026","externalUrl":null,"permalink":"/tags/sysadmin/","section":"Tags","summary":"","title":"SysAdmin","type":"tags"},{"content":"The cluster has storage and authentication. Now it needs a brain.\nIn Episode 4, we set up NFS shared storage, FreeIPA centralized authentication, and Ansible for cluster management. Every node shares the same home directory and user accounts work everywhere.\nBut right now, if you want to run a job, you SSH into a compute node and run it directly. That is fine for one person on one node. It falls apart the moment two people try to use the same node at the same time, or when you need to coordinate work across multiple nodes. That is what a job scheduler solves.\nThis episode covers Slurm: why we build it from source, how Munge handles authentication between nodes, what slurm.conf actually controls, and how to submit your first real cluster job.\n*(Click the image to watch the tutorial on YouTube)* 1. What Slurm Actually Does # Without a job scheduler, a shared cluster works like a kitchen with no coordination. Everyone grabs resources when they want them. One person\u0026rsquo;s job starves another. There is no way to ask for two nodes at once and have them guaranteed to be free at the same time.\nSlurm is the receptionist from the HPC 101 series, at scale. It tracks every CPU, every gigabyte of memory, and every GPU across all nodes. When you submit a job, Slurm holds it in a queue until the requested resources are available, then assigns it to the right nodes and runs it.\nThe three components we need:\nslurmctld runs on the management node (arbiter). It is the controller: maintains the queue, makes scheduling decisions, and talks to the compute nodes.\nslurmd runs on each compute node. It receives job assignments from the controller, runs the actual work, and reports back.\nslurmdbd also runs on arbiter. It connects Slurm to a MariaDB database and records every job: who ran it, how long it took, how much CPU and memory it used. This powers seff, sacct, and fair share scheduling.\nOur cluster layout:\n2. Why Build from Source # The obvious question is why not just dnf install slurm. There are two reasons.\nVersion control. When you run dnf upgrade on all nodes, Slurm gets upgraded too. A version mismatch between slurmctld and slurmd breaks the cluster. The controller and compute nodes must run identical versions. Building from source and distributing RPMs means you control exactly when Slurm gets updated, separate from the rest of the system.\nFeature support. Rocky Linux 10 runs cgroup v2 by default. Older Slurm builds default to cgroup v1, which causes job accounting and memory tracking to fail silently. Building from source lets you pass --with cgroupv2 explicitly. Similarly, PMIx support for MPI job launching requires build flags that are not included in the standard distribution packages.\nThe build process compiles Slurm on the management node (arbiter) and packages it as RPMs, which then get distributed to all other nodes via Ansible.\n# Build on arbiter, targeting Slurm 25.11.1 rpmbuild -ta slurm-25.11.1.tar.bz2 \\ --define \u0026#34;_slurm_sysconfdir /etc/slurm\u0026#34; \\ --with cgroupv2 \\ --with pmix EPEL for runtime dependencies # The build pulls in gtk2-devel as a development dependency, which causes the resulting slurm base RPM to depend on the GTK2 runtime libraries libgdk-x11-2.0.so.0 and libgtk-x11-2.0.so.0 (used by sview, Slurm\u0026rsquo;s GUI viewer). On Rocky Linux 10 these libraries are not in the default repositories. They live in EPEL, so EPEL must be enabled on every node before the install step in section 4, or dnf rejects the local RPMs with a depsolve error.\n[wpaik@arbiter ansible]$ ansible all_nodes -b -m dnf -a \u0026#34;name=epel-release state=present\u0026#34; If you prefer to avoid the GTK2 dependency entirely, pass --without gtk to rpmbuild and sview gets dropped from the build. HPC compute nodes never run sview anyway, so this is the cleaner option for a headless cluster.\nAll build dependencies, the full build playbook, and the RPM distribution playbook are in the GitHub repository.\n3. Munge: The Authentication Layer # Before Slurm can communicate between nodes, it needs a way to verify that messages are actually coming from the cluster and not from somewhere else. That is Munge\u0026rsquo;s job.\nMunge generates encrypted tokens using a shared secret key. Every node in the cluster has the same key at /etc/munge/munge.key. When slurmctld sends a message to slurmd, it attaches a Munge token. The compute node decrypts it with the shared key and verifies the message is legitimate.\nThe key is generated once on arbiter and distributed to all nodes by Ansible:\n# Generate key on arbiter dd if=/dev/urandom bs=1 count=1024 \u0026gt; /etc/munge/munge.key chmod 400 /etc/munge/munge.key chown munge:munge /etc/munge/munge.key Critical: Slurm UID must match across all nodes.\nMunge verifies not just the key but also the UID of the process that created the token. If the slurm user has UID 386 on arbiter and UID 990 on interceptor-01, Munge will reject the token with a security violation error. The cluster will appear to start but jobs will never run.\nWe set a fixed UID of 1111 for the Slurm user on every node before installing Slurm:\ngroupadd -g 1111 slurm useradd -u 1111 -g slurm -s /bin/bash -d /var/lib/slurm slurm Verify all nodes have matching UIDs:\n[wpaik@arbiter ansible]$ ansible all_nodes -m shell -a \u0026#34;id slurm\u0026#34; -b arbiter.cluster.local | rc=0 \u0026gt;\u0026gt; uid=1111(slurm) gid=1111(slurm) groups=1111(slurm) interceptor-01.cluster.local | rc=0 \u0026gt;\u0026gt; uid=1111(slurm) gid=1111(slurm) groups=1111(slurm) interceptor-02.cluster.local | rc=0 \u0026gt;\u0026gt; uid=1111(slurm) gid=1111(slurm) groups=1111(slurm) corsair-01.cluster.local | rc=0 \u0026gt;\u0026gt; uid=1111(slurm) gid=1111(slurm) groups=1111(slurm) carrier.cluster.local | rc=0 \u0026gt;\u0026gt; uid=1111(slurm) gid=1111(slurm) groups=1111(slurm) All matching. Verify Munge is running and the shared key works:\n# Test Munge authentication locally $ munge -n | unmunge # Test across nodes $ munge -n | ssh interceptor-01.cluster.local unmunge STATUS: Success (0) ENCODE_HOST: arbiter.cluster.local (192.168.50.50) DECODE_HOST: interceptor-01.cluster.local (192.168.50.15) MUNGE_UID: slurm (1111) Note on firewall: Worker nodes have firewalld disabled. The login node (carrier) has its internal interface in the trusted zone. If you are running firewalld on compute nodes, open ports 6817 (slurmctld), 6818 (slurmd), and 6819 (slurmdbd).\n4. Installing Slurm # After building the RPMs on arbiter, Ansible distributes and installs them across the cluster. Each node gets a different set of packages depending on its role.\nNode type Packages Management (arbiter) slurm, slurmctld, slurmdbd, mariadb Compute (interceptor, corsair) slurm, slurmd, slurm-libpmi Login (carrier) slurm, slurm-contribs (includes seff) slurm-libpmi on the compute nodes provides the PMI2 and PMIx libraries that MPI implementations use to launch parallel processes via srun. Without it, MPI jobs fail with PMI version errors when trying to use srun as the launcher.\nslurm-contribs on the login node includes seff, the job efficiency tool. It reads accounting data from slurmdbd and shows you exactly how much CPU and memory your job actually used versus what you requested.\nThe install playbook expects two things to already be true: EPEL is enabled on every node (section 2), and the Ansible controller\u0026rsquo;s remote_tmp points to a local path on the target nodes (set in Episode 4\u0026rsquo;s ansible.cfg). The second one matters because the install copies RPMs through Ansible\u0026rsquo;s staging directory. If that directory lives on NFS (the default location on this cluster, since /home is NFS-mounted), the RPMs inherit the nfs_t SELinux context, and dnf rejects them with a confusing No match for argument error even though the file is plainly on disk. The remote_tmp = /var/tmp/.ansible-${USER}/tmp line in ansible.cfg keeps the staging area on local disk and avoids the trap.\nAfter installation completes successfully, pin the Slurm version in dnf so a future dnf upgrade does not pull a different build (most notably from EPEL, which ships its own slurm packages without our cgroup v2 and PMIx flags). The install playbook handles this as its last step:\nansible all_nodes -b -m shell -a \u0026#34;echo \u0026#39;exclude=slurm*\u0026#39; \u0026gt;\u0026gt; /etc/dnf/dnf.conf\u0026#34; # Verify ansible all_nodes -b -m shell -a \u0026#34;grep slurm /etc/dnf/dnf.conf\u0026#34; The order matters: pin after the install succeeds, never before. Pinning before install causes dnf to refuse to install slurm at all, again with a No match for argument error. When you eventually need to upgrade Slurm, remove the line first, rebuild, reinstall, and the playbook re-adds the pin at the end.\nThe complete installation playbooks are in the GitHub repository under ep05-slurm/playbooks/.\n5. Configuring Slurm # All Slurm configuration lives in /etc/slurm/slurm.conf on every node. The file must be identical across the cluster. We generate it on arbiter and distribute it via Ansible.\nHere is the complete slurm.conf for this cluster:\n# Cluster identity ClusterName=cluster SlurmctldHost=arbiter SlurmUser=slurm AuthType=auth/munge # Scheduling SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core_Memory # Logging SlurmctldDebug=info SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=debug SlurmdLogFile=/var/log/slurm/slurmd.log # State and PID files StateSaveLocation=/var/spool/slurmctld SlurmdSpoolDir=/var/spool/slurmd SlurmctldPidFile=/run/slurm/slurmctld.pid SlurmdPidFile=/run/slurm/slurmd.pid # Cgroup (v2) ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup,task/affinity # Job accounting JobAcctGatherType=jobacct_gather/cgroup JobAcctGatherFrequency=30 AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=arbiter.cluster.local AccountingStoragePort=6819 JobCompType=jobcomp/none AccountingStorageTRES=gres/gpu AccountingStoreFlags=job_comment,job_env,job_script # GPU support ReturnToService=1 GresTypes=gpu # MPI default MpiDefault=pmix # Nodes NodeName=interceptor-01 CPUs=8 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=15413 State=UNKNOWN NodeName=interceptor-02 CPUs=8 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=15413 State=UNKNOWN NodeName=corsair-01 CPUs=16 Sockets=1 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=30802 Gres=gpu:nvidia_geforce_gtx_1660_super:1 State=UNKNOWN # Partitions PartitionName=cpu Nodes=interceptor-01,interceptor-02 Default=YES MaxTime=INFINITE State=UP PartitionName=gpu Nodes=corsair-01 Default=NO MaxTime=INFINITE State=UP A few things worth noting:\nRealMemory values come from running free -m on each node, same as in Episode 2 for the iGPU memory trap. The values here reflect what the OS actually reports after hardware reservations. Do not use the installed RAM number.\nThe M715q nodes each have 16GB installed, but the integrated Vega GPU reserves a portion as VRAM. The exact amount depends on the BIOS UMA Frame Buffer Size setting. If this is left on Auto, different nodes may end up with slightly different values even with identical hardware. In Episode 2 we pinned arbiter\u0026rsquo;s UMA setting to 256MB explicitly. If your compute nodes still show different free -m totals, check the UMA setting in each node\u0026rsquo;s BIOS and pin them to the same value. The slurm.conf RealMemory for each node should match that node\u0026rsquo;s actual free -m total output.\nMpiDefault=pmix sets PMIx as the default MPI process management interface for srun. Without this, srun defaults to PMI2, which causes compatibility errors with OpenMPI when launching parallel jobs. If you see MPI jobs hanging or failing with PMI version errors, this is the first thing to check.\nSelectTypeParameters=CR_Core_Memory tells Slurm to track both cores and memory when allocating resources. This is required for seff to report memory usage accurately.\nThe cgroup configuration lives in a separate file:\n# /etc/slurm/cgroup.conf ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=no ConstrainDevices=yes ConstrainCores and ConstrainRAMSpace enforce the resource limits you request in your job script. If your job tries to use more memory than requested, Slurm kills it with an out-of-memory error rather than letting it consume resources silently. This requires cgroup v2, which is confirmed on this cluster:\n$ stat -fc %T /sys/fs/cgroup cgroup2fs MariaDB and slurmdbd store accounting data. The setup creates a slurm_acct_db database and a slurm database user, then configures slurmdbd to connect to it. The slurmdbd configuration in /etc/slurm/slurmdbd.conf must have mode 600 and be owned by the slurm user, or slurmdbd will refuse to start.\n6. Disabling Swap on Compute Nodes # Swap needs to be disabled on compute nodes before running Slurm jobs. When ConstrainRAMSpace=yes is set in cgroup.conf, Slurm enforces memory limits via cgroup. If swap is active, a process that hits the RAM limit can spill into swap instead of being killed, which defeats the memory constraint and makes seff memory reporting inaccurate.\nThe login node (carrier) and management node (arbiter) can keep swap enabled since they do not run compute jobs.\nDisable swap permanently on compute nodes via systemd:\nansible workers,gpu -b -m systemd \\ -a \u0026#34;name=swap.target state=stopped enabled=no\u0026#34; Verify after the next reboot:\n$ cat /proc/swaps Filename Type Size Used Priority # Empty output means swap is off Note: The swap UUID may still appear in /etc/fstab. This is fine as long as swap.target is disabled in systemd. The unit will fail to activate on boot with a dependency error, which is the expected behavior.\n7. Starting the Cluster # Services must start in order. slurmdbd must be running before slurmctld tries to connect to it.\n# On arbiter $ sudo systemctl start mariadb $ sudo systemctl start slurmdbd $ sudo systemctl start slurmctld # On each compute node $ sudo systemctl start slurmd After services are up, initialize the accounting database:\n$ sacctmgr -i add cluster cluster $ sacctmgr -i add account root Description=\u0026#34;Root\u0026#34; Organization=\u0026#34;Cluster\u0026#34; $ sacctmgr -i add user wpaik Account=root Check cluster status:\n[wpaik@carrier ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST cpu* up infinite 2 idle interceptor-[01-02] gpu up infinite 1 idle corsair-01 All nodes idle and ready. If nodes show as down or drain instead of idle, resume them:\n$ scontrol update NodeName=ALL State=RESUME 8. Submitting Your First Jobs # Interactive Job # [wpaik@carrier ~]$ srun --pty bash [wpaik@interceptor-01 ~]$ hostname interceptor-01 [wpaik@interceptor-01 ~]$ exit srun assigned you to interceptor-01 because it is the first node in the default cpu partition.\nBatch Job # Create a simple batch script:\n#!/bin/bash #SBATCH --job-name=hello #SBATCH --partition=cpu #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --mem=500M #SBATCH --time=00:05:00 #SBATCH --output=hello_%j.out echo \u0026#34;Running on: $(hostname)\u0026#34; echo \u0026#34;Job ID: $SLURM_JOB_ID\u0026#34; date sleep 10 echo \u0026#34;Done.\u0026#34; Submit and monitor:\n$ sbatch hello.sh Submitted batch job 1 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST 1 cpu hello wpaik R 0:03 1 interceptor-01 $ cat hello_1.out Running on: interceptor-01 Job ID: 1 Fri May 9 21:00:00 EDT 2026 Done. Multi-Node Job # #!/bin/bash #SBATCH --job-name=multinode #SBATCH --partition=cpu #SBATCH --nodes=2 #SBATCH --ntasks-per-node=4 #SBATCH --mem-per-cpu=1G #SBATCH --output=multinode_%j.out srun hostname $ sbatch multinode.sh Submitted batch job 2 $ cat multinode_2.out interceptor-01 interceptor-01 interceptor-01 interceptor-01 interceptor-02 interceptor-02 interceptor-02 interceptor-02 Eight tasks across two physical machines, coordinated by Slurm.\nGPU Job # #!/bin/bash #SBATCH --job-name=gpu_test #SBATCH --partition=gpu #SBATCH --nodes=1 #SBATCH --gres=gpu:1 #SBATCH --output=gpu_%j.out nvidia-smi Checking Efficiency with seff # After a job completes, check how efficiently it used the requested resources:\n$ seff 1 Job ID: 1 Cluster: cluster User/Group: wpaik/wpaik State: COMPLETED (exit code 0) Cores: 1 CPU Utilized: 00:00:01 CPU Efficiency: 10.00% of 00:00:10 core-walltime Job Wall-clock time: 00:00:10 Memory Utilized: 1.20 MB Memory Efficiency: 0.24% of 500.00 MB CPU efficiency is low because sleep 10 does nothing. Memory efficiency is low because we requested 500MB but the script barely used any. This is exactly the kind of feedback seff is designed to give. Right-size your resource requests based on what jobs actually use.\n9. Common Issues # Nodes stuck in down or drain state after startup\n$ scontrol update NodeName=ALL State=RESUME If they keep going back to down, check the slurmd log on the affected node:\n$ ssh interceptor-01 \u0026#34;sudo tail -n 50 /var/log/slurm/slurmd.log\u0026#34; Slurm UID mismatch (Security violation)\nIf srun hangs or you see authentication errors in the logs, check that the slurm user has the same UID on every node:\n$ ansible all_nodes -m shell -a \u0026#34;id slurm\u0026#34; -b If UIDs differ, use 08_sync_slurm_uid.yaml from the GitHub repository to fix them. Note that if the target UID is occupied by another system user on a particular node, you will need to reassign that user to a different UID first before moving slurm into place.\nMPI jobs fail with PMI errors\nCheck that MpiDefault=pmix is in slurm.conf and that slurm-libpmi is installed on compute nodes. Also verify that the PMIx security mode is set:\n$ cat /etc/profile.d/pmix.sh export PMIX_MCA_psec=native slurmdbd fails to start\nCheck permissions on /etc/slurm/slurmdbd.conf. It must be mode 600 and owned by the slurm user:\n$ ls -la /etc/slurm/slurmdbd.conf -rw------- 1 slurm slurm 312 Apr 27 09:00 /etc/slurm/slurmdbd.conf Also verify MariaDB is running before starting slurmdbd:\n$ sudo systemctl status mariadb seff shows no memory data\nseff requires JobAcctGatherType=jobacct_gather/cgroup in slurm.conf and ConstrainRAMSpace=yes in cgroup.conf. Both require cgroup v2. Verify with stat -fc %T /sys/fs/cgroup.\ndnf install fails with No match for argument even though the RPM is on disk\nTwo distinct causes both surface as this same error:\nSELinux context inherited from NFS. Ansible\u0026rsquo;s per-task staging directory defaults to ~/.ansible/tmp/, which on this cluster lives on NFS-mounted /home. Files copied through it pick up the nfs_t SELinux context, and dnf silently refuses to handle them as local RPMs. Confirm with ls -lZ /tmp/slurm_rpms/ — if the context is nfs_t, this is it. The permanent fix is the remote_tmp = /var/tmp/.ansible-${USER}/tmp line in ansible.cfg from Episode 4. As an immediate workaround:\nsudo restorecon -Rv /tmp/slurm_rpms/ dnf exclude pinning was added before install. If /etc/dnf/dnf.conf already contains exclude=slurm* from a previous run, dnf strips the matching argument and reports it as missing. Check with grep slurm /etc/dnf/dnf.conf. For a reinstall, either remove the line first or pass --disableexcludes=all:\nsudo dnf install -y --disableexcludes=all /tmp/slurm_rpms/slurm-*.rpm dnf install fails with nothing provides libgdk-x11-2.0.so.0 or libgtk-x11-2.0.so.0\nEPEL is not enabled on the failing node. The Slurm base RPM depends on GTK2 runtime libraries that are not in Rocky 10\u0026rsquo;s default repositories. Install EPEL on the affected node and retry:\nsudo dnf install -y epel-release Or rebuild Slurm with --without gtk so the GTK2 dependency is removed entirely.\n10. What is Next # The cluster is now a real HPC system. Jobs are scheduled, resources are tracked, and seff shows efficiency data after each run.\nThe next episode covers Slurm accounting in depth: setting up accounts and users in slurmdbd, configuring partitions with resource limits, and fair share scheduling so heavy users do not monopolize the cluster.\nAll Ansible playbooks, configuration files, and the Slurm build scripts from this episode are in the GitHub repository.\nHappy Computing!\n","date":"20 May 2026","externalUrl":null,"permalink":"/posts/hpc-from-scratch-05/","section":"Posts","summary":"Getting Slurm running from source is more involved than installing from a package manager, but it gives you full control over version and build options. Episode 5 covers the complete setup: MUNGE for authentication, slurmctld on the controller, slurmd on compute nodes, and verifying everything works with a real batch job.","title":"[HPC From Scratch] Episode 5: How to Install Slurm from Source on Rocky Linux","type":"posts"},{"content":"","date":"20 May 2026","externalUrl":null,"permalink":"/tags/munge/","section":"Tags","summary":"","title":"Munge","type":"tags"},{"content":"One drive. One login. Every node sees the same home directory.\nIn Episode 3, we set up the network, installed Rocky Linux on all six nodes, configured DHCP and NAT, and hardened SSH. The cluster is networked and secured. Now it needs two things before Slurm makes any sense: shared storage and centralized authentication.\nWithout these two pieces, you are manually copying files to every node and creating the same user account six times. This episode fixes both problems.\n*(Click the image to watch the tutorial on YouTube)* 1. Why Shared Storage Matters # Without NFS, submitting an MPI job across two nodes means your input data has to exist on both nodes. You either copy it manually or write a script to sync it. Neither is sustainable.\nWith NFS, the Samsung 990 Pro on arbiter (the management node) exports a single /home directory. Every node in the cluster mounts it. Write a script on the login node, run it from any compute node. The file is already there.\nThis also matters for Slurm. When a job writes output files, they land in /home on the NFS share. You do not need to SSH into compute nodes to retrieve results.\nPrerequisites\nBefore starting this episode:\nAll nodes are running Rocky Linux 10 with network configured (Episode 3) arbiter has the Samsung 990 Pro NVMe drive installed (Episode 2) SSH key-based login is working from arbiter to all other nodes 2. Ansible Setup # From this episode onward, we use Ansible to apply configuration across all nodes at once. Without it, every change means SSHing into six machines individually.\nAnsible runs from arbiter. We keep it in /opt/ansible rather than a home directory so it stays off the NFS share. Ansible configuration files contain SSH keys and vault passwords that should not be visible to every node in the cluster.\nInstall Ansible # [wpaik@arbiter ~]$ sudo dnf install ansible-core [wpaik@arbiter ~]$ sudo mkdir -p /opt/ansible [wpaik@arbiter ~]$ sudo chown wpaik:wpaik /opt/ansible [wpaik@arbiter ~]$ cd /opt/ansible SSH Key # Generate a dedicated key for Ansible and distribute it to all nodes:\n[wpaik@arbiter ansible]$ mkdir .ssh [wpaik@arbiter ansible]$ ssh-keygen -t ed25519 -f .ssh/worker_ed25519 -N \u0026#34;\u0026#34; [wpaik@arbiter ansible]$ for node in 192.168.50.1 192.168.50.15 192.168.50.32 192.168.50.11 192.168.50.19; do ssh-copy-id -i .ssh/worker_ed25519.pub wpaik@$node done Inventory and Config # Create hosts.ini:\n[head] carrier.cluster.local ansible_host=192.168.50.1 [management] arbiter.cluster.local ansible_host=192.168.50.50 ansible_connection=local [workers] interceptor-01.cluster.local ansible_host=192.168.50.15 interceptor-02.cluster.local ansible_host=192.168.50.32 [gpu] corsair-01.cluster.local ansible_host=192.168.50.11 [visualization] observer.cluster.local ansible_host=192.168.50.19 [compute:children] workers gpu [all_nodes:children] head management workers gpu visualization [all_nodes:vars] ansible_user=wpaik cluster_network=192.168.50.0/24 cluster_domain=cluster.local cluster_realm=CLUSTER.LOCAL Note that arbiter uses ansible_connection=local since it is the Ansible controller itself.\nCreate ansible.cfg:\n[defaults] private_key_file = /opt/ansible/.ssh/worker_ed25519 inventory = ./hosts.ini host_key_checking = False log_path = ./log/ansible.log vault_password_file = /opt/ansible/.ansible_vault_pw remote_tmp = /var/tmp/.ansible-${USER}/tmp The last line, remote_tmp, deserves a note since it is the one setting that bites you only later. By default Ansible writes its per-task staging files into ~/.ansible/tmp/ on the remote node. After we set up NFS in section 3, every node\u0026rsquo;s /home lives on the NFS share, so that staging directory ends up on NFS. Files written there get the nfs_t SELinux context, which dnf refuses to handle when installing local RPMs in later episodes. The failure mode is misleading as dnf reports No match for argument for an RPM file that visibly exists on disk. Pinning remote_tmp to a local path on each node (/var/tmp is always local) sidesteps this entirely. It costs nothing now and saves a long debugging session in Episode 5.\nVerify connectivity:\n[wpaik@arbiter ansible]$ ansible all -m ping carrier.cluster.local | SUCCESS =\u0026gt; { \u0026#34;ping\u0026#34;: \u0026#34;pong\u0026#34; } arbiter.cluster.local | SUCCESS =\u0026gt; { \u0026#34;ping\u0026#34;: \u0026#34;pong\u0026#34; } interceptor-01.cluster.local | SUCCESS =\u0026gt; { \u0026#34;ping\u0026#34;: \u0026#34;pong\u0026#34; } interceptor-02.cluster.local | SUCCESS =\u0026gt; { \u0026#34;ping\u0026#34;: \u0026#34;pong\u0026#34; } corsair-01.cluster.local | SUCCESS =\u0026gt; { \u0026#34;ping\u0026#34;: \u0026#34;pong\u0026#34; } observer.cluster.local | SUCCESS =\u0026gt; { \u0026#34;ping\u0026#34;: \u0026#34;pong\u0026#34; } All six nodes responding. From here on, playbooks handle the repetitive work.\n3. NFS Server Setup # All commands in this section run on arbiter.\nPartition the NVMe Drive with LVM # A single large partition works, but LVM gives us the flexibility to allocate separate volumes for home directories, work storage, shared software, and scratch space. This mirrors how storage is typically organized on a real HPC cluster.\nFirst, verify the NVMe drive:\n[wpaik@arbiter ~]$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS sda 8:0 0 223.6G 0 disk ├─sda1 8:1 0 600M 0 part /boot/efi ├─sda2 8:2 0 1G 0 part /boot └─sda3 8:3 0 222G 0 part ├─rl-root 253:0 0 70G 0 lvm / └─rl-swap 253:1 0 7.7G 0 lvm [SWAP] nvme0n1 259:0 0 931.5G 0 disk The SATA boot drive is sda. The NVMe is nvme0n1. Create a physical volume, volume group, and four logical volumes:\n# Install LVM tools $ sudo dnf install -y lvm2 # Create physical volume and volume group $ sudo pvcreate /dev/nvme0n1 $ sudo vgcreate vg_nfs /dev/nvme0n1 # Create logical volumes $ sudo lvcreate -L 167G -n lv_home vg_nfs $ sudo lvcreate -L 251G -n lv_work vg_nfs $ sudo lvcreate -L 84G -n lv_shared vg_nfs $ sudo lvcreate -L 251G -n lv_scratch vg_nfs # Format as XFS $ sudo mkfs.xfs /dev/vg_nfs/lv_home $ sudo mkfs.xfs /dev/vg_nfs/lv_work $ sudo mkfs.xfs /dev/vg_nfs/lv_shared $ sudo mkfs.xfs /dev/vg_nfs/lv_scratch Create mount points and mount:\n$ sudo mkdir -p /nfsdata/{home,work,shared,scratch} $ sudo mount /dev/vg_nfs/lv_home /nfsdata/home $ sudo mount /dev/vg_nfs/lv_work /nfsdata/work $ sudo mount /dev/vg_nfs/lv_shared /nfsdata/shared $ sudo mount /dev/vg_nfs/lv_scratch /nfsdata/scratch Add to /etc/fstab for persistence:\n$ echo \u0026#39;/dev/vg_nfs/lv_home /nfsdata/home xfs defaults 0 0\u0026#39; | sudo tee -a /etc/fstab $ echo \u0026#39;/dev/vg_nfs/lv_work /nfsdata/work xfs defaults 0 0\u0026#39; | sudo tee -a /etc/fstab $ echo \u0026#39;/dev/vg_nfs/lv_shared /nfsdata/shared xfs defaults 0 0\u0026#39; | sudo tee -a /etc/fstab $ echo \u0026#39;/dev/vg_nfs/lv_scratch /nfsdata/scratch xfs defaults 0 0\u0026#39; | sudo tee -a /etc/fstab Bind mount /nfsdata/home to /home on arbiter itself, so the management node also uses the NFS storage:\n$ echo \u0026#39;/nfsdata/home /home none bind 0 0\u0026#39; | sudo tee -a /etc/fstab $ sudo mount -a Verify the final layout:\n[wpaik@arbiter ~]$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS sda 8:0 0 223.6G 0 disk ├─sda1 8:1 0 600M 0 part /boot/efi ├─sda2 8:2 0 1G 0 part /boot └─sda3 8:3 0 222G 0 part ├─rl-root 253:0 0 70G 0 lvm / ├─rl-swap 253:1 0 7.7G 0 lvm [SWAP] └─rl-home 253:6 0 144.3G 0 lvm nvme0n1 259:0 0 931.5G 0 disk ├─vg_nfs-lv_home 253:2 0 167G 0 lvm /home │ /nfsdata/home ├─vg_nfs-lv_work 253:3 0 251G 0 lvm /nfsdata/work ├─vg_nfs-lv_shared 253:4 0 84G 0 lvm /nfsdata/shared └─vg_nfs-lv_scratch 253:5 0 251G 0 lvm /nfsdata/scratch The bind mount makes lv_home appear twice: once at /nfsdata/home (the actual mount point) and once at /home (the bind mount that arbiter itself uses). The other three volumes only mount at their /nfsdata paths on arbiter. Client nodes will mount them at /work, /shared, and /scratch via NFS.\nConfigure the NFS Server # $ sudo dnf install -y nfs-utils $ sudo systemctl enable --now nfs-server Configure /etc/exports:\n/nfsdata/home 192.168.50.0/24(rw,sync,no_root_squash,no_subtree_check) /nfsdata/work 192.168.50.0/24(rw,sync,no_root_squash,no_subtree_check) /nfsdata/shared 192.168.50.0/24(rw,sync,no_root_squash,no_subtree_check) /nfsdata/scratch 192.168.50.0/24(rw,sync,no_root_squash,no_subtree_check) A quick note on the options: rw allows read and write, sync commits writes to disk before responding (safer), no_subtree_check avoids a performance penalty when exporting subdirectories, and no_root_squash lets root on client nodes act as root on the share, which Slurm will need later.\nNote on no_root_squash: This is appropriate for a trusted internal cluster network. Our cluster is physically isolated on the 192.168.50.x subnet. On a shared cluster with untrusted users, use root_squash instead.\nApply and open the firewall:\n$ sudo exportfs -ra $ sudo firewall-cmd --permanent --add-service={nfs,rpc-bind,mountd} $ sudo firewall-cmd --reload # Verify $ sudo showmount -e localhost Export list for localhost: /nfsdata/scratch 192.168.50.0/24 /nfsdata/shared 192.168.50.0/24 /nfsdata/work 192.168.50.0/24 /nfsdata/home 192.168.50.0/24 4. NFS Client Setup # Rather than SSHing into each node manually, use Ansible. Run from /opt/ansible on arbiter:\n[wpaik@arbiter ansible]$ ansible-playbook playbooks/nfs_setup.yaml -K What the playbook does on each client node: installs nfs-utils, sets the SELinux boolean for NFS home directories, creates mount points for /work, /shared, and /scratch, adds all four NFS mounts to /etc/fstab with _netdev, and mounts them.\nThe _netdev option tells the system to wait for network availability before mounting. Without it, a node that boots faster than arbiter will fail to mount and potentially hang at boot.\nThe playbook also enables XFS quota on arbiter and reboots it to apply. This is covered in the full playbook in the GitHub repository.\nVerify from carrier after rebooting:\n[wpaik@carrier ~]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/rl-root 70G 5.4G 65G 8% / arbiter.cluster.local:/nfsdata/home 167G 8.2G 159G 5% /home arbiter.cluster.local:/nfsdata/work 251G 4.9G 247G 2% /work arbiter.cluster.local:/nfsdata/shared 84G 23G 62G 27% /shared arbiter.cluster.local:/nfsdata/scratch 251G 22G 230G 9% /scratch Note: The playbook reboots worker and GPU nodes automatically. carrier (the head node) requires a manual reboot after the playbook completes since it is the SSH entry point into the cluster. After rebooting carrier, verify mounts with df -h.\nBefore moving on to FreeIPA, run the Chrony playbook to synchronize time across all nodes:\n[wpaik@arbiter ansible]$ ansible-playbook playbooks/chrony_setup.yaml -K This sets up carrier as the NTP server for the cluster and configures all other nodes to sync from it. FreeIPA uses Kerberos for authentication, and Kerberos will reject tickets if the time difference between nodes exceeds 5 minutes. Running Chrony before FreeIPA avoids that problem.\nTest that the share works:\n# Create a test file from interceptor-01 [wpaik@interceptor-01 ~]$ touch /home/nfs_test.txt # Verify it appears on interceptor-02 [wpaik@interceptor-02 ~]$ ls /home/nfs_test.txt /home/nfs_test.txt One file, visible everywhere.\n5. Time Synchronization (Chrony) # Before setting up FreeIPA, all nodes need to be synchronized to the same time source. FreeIPA uses Kerberos for authentication, and Kerberos will reject tickets if the clock difference between nodes exceeds 5 minutes. On a fresh cluster this is usually fine, but it is better to set it up explicitly.\ncarrier acts as the NTP server for the cluster. It syncs from external sources (time.cloudflare.com, pool.ntp.org) and serves time to all internal nodes. The other nodes sync from carrier.\n[wpaik@arbiter ansible]$ ansible-playbook playbooks/chrony_setup.yaml -K Verify sync status on any node after the playbook completes:\n$ chronyc tracking Reference ID : C0A83201 (carrier.cluster.local) Stratum : 3 System time : 0.000123456 seconds fast of NTP time Last offset : +0.000045678 seconds RMS offset : 0.000089012 seconds Reference ID pointing to carrier.cluster.local confirms the node is syncing from carrier.\n6. The Problem with Local Users # NFS solves the file sharing problem. But it creates a new one.\nNFS uses UID (User ID) and GID (Group ID) numbers to handle file permissions, not usernames. When user will on interceptor-01 has UID 1001, and user will on interceptor-02 has UID 1002 (because you created the accounts in a different order), they see different permissions on the same NFS files.\n# On interceptor-01 $ id will uid=1001(will) gid=1001(will) # On interceptor-02 $ id will uid=1002(will) gid=1002(will) # The NFS file owned by will on interceptor-01 (uid=1001) # looks like it belongs to a different user on interceptor-02 You can work around this by manually synchronizing UIDs across every node. On a six-node cluster with a few users, that is tedious but manageable. On a real cluster with hundreds of users, it is not viable.\nThe proper solution is centralized authentication: one place where user accounts are defined, and every node pulls from that source. This is what FreeIPA provides.\nPre-flight: UID Alignment # NFS does not compare usernames. It compares the numeric UID and GID stamped on every file. If wpaik has UID 1000 on arbiter but UID 1001 on interceptor-01, every file written from interceptor-01 lands on the share owned by UID 1001, and arbiter cannot find a matching user. Reads and writes silently misbehave or fail outright.\nFor a fresh six-node build done in one sitting, this usually does not bite. Rocky\u0026rsquo;s installer assigns UID 1000 to the first user created during installation, so as long as wpaik was the first user on every node, the numbers line up by themselves. The hazard appears later: a node reinstalled out of band, a kickstart that differs between machines, or an extra account created during install before wpaik. The UID drifts, NFS quietly breaks, and the failure mode is confusing because everything else looks fine.\nCheck before mounting anything:\n[wpaik@arbiter ansible]$ ansible all_nodes -a \u0026#34;id wpaik\u0026#34; Every node should report the same uid= and gid=. If one differs, align it against arbiter\u0026rsquo;s value (typically 1000, but verify) before continuing.\nThe fix runs on the misaligned node, as a different sudoer or as root, with no active wpaik session. The example below assumes arbiter has wpaik at UID 1000 and the misaligned node currently has 1001. Substitute your actual values.\n# On the misaligned node, as root or another sudoer [root@interceptor-01 ~]# who | grep wpaik # confirm no live session [root@interceptor-01 ~]# pkill -KILL -u wpaik # kill any leftovers # If NFS is already mounted, unmount first [root@interceptor-01 ~]# umount /home # use -l if busy # Renumber the account [root@interceptor-01 ~]# groupmod -g 1000 wpaik [root@interceptor-01 ~]# usermod -u 1000 -g 1000 wpaik # Fix ownership of files under the old UID. # -xdev keeps find on the local filesystem, so other partitions # and NFS mounts (if any are still present) are not touched. [root@interceptor-01 ~]# find / -xdev -uid 1001 -exec chown -h 1000 {} + [root@interceptor-01 ~]# find / -xdev -gid 1001 -exec chgrp -h 1000 {} + # Verify [root@interceptor-01 ~]# id wpaik uid=1000(wpaik) gid=1000(wpaik) groups=1000(wpaik),10(wheel) If wpaik belonged to extra groups before (wheel, for example), check with groups wpaik and re-add anything that got dropped during the usermod.\nThis is a stopgap. FreeIPA in Section 7 replaces local accounts with centralized identity and the question stops mattering. Until then, UID alignment is something you manage by hand whenever a node joins the cluster out of cycle.\n7. FreeIPA Server Installation # FreeIPA bundles several services into one package: LDAP (directory), Kerberos (authentication), DNS, and a certificate authority. The installation is opinionated and sets everything up together.\nAll commands in this section run on arbiter.\nPrerequisites # FreeIPA requires a fully qualified domain name (FQDN). Verify it resolves correctly before proceeding:\n[wpaik@arbiter ~]$ hostname -f arbiter.cluster.local [wpaik@arbiter ~]$ ping -c 1 arbiter.cluster.local PING arbiter.cluster.local (192.168.50.50) 56(84) bytes of data. Also verify at least 1.5GB of free RAM. The installer is memory-hungry:\n$ free -h total used free Mem: 15Gi 800Mi 14Gi Install and Run the Server Setup # $ sudo dnf install -y freeipa-server freeipa-server-dns $ sudo ipa-server-install \\ --domain=cluster.local \\ --realm=CLUSTER.LOCAL \\ --ds-password=\u0026lt;your_directory_manager_password\u0026gt; \\ --admin-password=\u0026lt;your_admin_password\u0026gt; \\ --hostname=arbiter.cluster.local \\ --ip-address=192.168.50.50 \\ --no-ntp \\ --unattended A few things to note: --realm must be uppercase, --no-ntp skips NTP configuration since we manage time sync with Chrony separately, and --unattended skips interactive prompts. The installer takes 5-10 minutes and configures LDAP, Kerberos, and the CA.\nAfter completion, open the required firewall ports:\n$ sudo firewall-cmd --permanent --add-service={freeipa-ldap,freeipa-ldaps,kerberos,dns,http,https} $ sudo firewall-cmd --reload Verify the Installation # $ kinit admin Password for admin@CLUSTER.LOCAL: $ klist Ticket cache: KCM:0 Default principal: admin@CLUSTER.LOCAL Valid starting Expires Service principal 04/27/26 09:00:00 04/28/26 09:00:00 krbtgt/CLUSTER.LOCAL@CLUSTER.LOCAL $ ipa user-find --------------- 0 users matched --------------- No users yet. We will add them after enrollment.\nSet the default shell to bash (the FreeIPA default is /bin/sh):\n$ ipa config-mod --defaultshell=/bin/bash 8. FreeIPA Client Enrollment # Before enrolling, add arbiter to /etc/hosts on every node. The enrollment process needs to resolve arbiter.cluster.local, and at this point SSSD is not yet configured. Doing this beforehand ensures enrollment does not fail on DNS resolution.\nThe Ansible playbook handles this automatically:\n[wpaik@arbiter ansible]$ ansible-playbook playbooks/freeipa_setup.yaml -K If you prefer to do it manually on each node:\n# Add arbiter to /etc/hosts $ echo \u0026#34;192.168.50.50 arbiter.cluster.local arbiter\u0026#34; | sudo tee -a /etc/hosts # Install and enroll $ sudo dnf install -y freeipa-client oddjob-mkhomedir $ sudo ipa-client-install \\ --server=arbiter.cluster.local \\ --domain=cluster.local \\ --realm=CLUSTER.LOCAL \\ --principal=admin \\ --password=\u0026lt;your_admin_password\u0026gt; \\ --mkhomedir \\ --no-ntp \\ --unattended The --mkhomedir flag tells the system to create a home directory on first login. Since /home is NFS-mounted from arbiter, the directory lands on the NFS share and is immediately visible from all nodes.\nAfter enrollment, confirm each node can reach the IPA server:\n[wpaik@interceptor-01 ~]$ ipa user-find --------------- 0 users matched --------------- If this returns a response (even 0 users), the client is enrolled and talking to the server.\nCreate a Test User # Back on arbiter:\n[wpaik@arbiter ~]$ kinit admin $ ipa user-add testuser \\ --first=Test \\ --last=User \\ --password $ ipa user-find testuser -------------- 1 user matched -------------- User login: testuser First name: Test Last name: User Home directory: /home/testuser Login shell: /bin/bash UID: 99100XXXX GID: 99100XXXX Notice the UID range. FreeIPA assigns UIDs starting well above the range used by local system accounts, avoiding any collision. The exact starting range depends on how FreeIPA was configured during installation, but whatever it assigns will be identical on every node in the cluster.\nFor ongoing user management, the scripts/user_creation.sh script in the GitHub repository handles the full process: FreeIPA account creation, home directory setup with correct NFS ownership, XFS quota, and Slurm accounting entry.\nAccessing the FreeIPA Web UI # The FreeIPA web interface is reachable from outside the cluster using sshuttle, a VPN-over-SSH tool that routes traffic through the login node.\nOn your local machine:\n# Install sshuttle $ sudo dnf install sshuttle # Fedora/RHEL # or: pip install sshuttle # Add arbiter to your local /etc/hosts $ echo \u0026#34;192.168.50.50 arbiter arbiter.cluster.local\u0026#34; | sudo tee -a /etc/hosts # Open the tunnel (keep this terminal open) $ sshuttle -r wpaik@carrier.cluster.local 192.168.50.0/24 --dns Then open a browser and go to https://arbiter.cluster.local/ipa/ui/. Accept the self-signed certificate warning and log in with the admin credentials.\n9. Verification # SSH as the new user from the login node to a compute node:\n[wpaik@carrier ~]$ ssh testuser@interceptor-01 Password: Creating home directory for testuser. [testuser@interceptor-01 ~]$ pwd /home/testuser [testuser@interceptor-01 ~]$ id uid=99100XXXX(testuser) gid=99100XXXX(testuser) groups=99100XXXX(testuser) Now check the same user from a different node:\n[testuser@interceptor-02 ~]$ id uid=99100XXXX(testuser) gid=99100XXXX(testuser) groups=99100XXXX(testuser) Same UID on both nodes. Files written on interceptor-01 have correct permissions on interceptor-02. The home directory is the same NFS path regardless of which node you land on.\nOne account. Every node. One home directory.\nTroubleshooting Common Issues # Enrollment fails with DNS error: The playbook adds arbiter.cluster.local to /etc/hosts before enrollment. If it still fails, verify the entry exists on the failing node:\n$ getent hosts arbiter.cluster.local 192.168.50.50 arbiter.cluster.local arbiter If missing, add it manually:\n$ echo \u0026#34;192.168.50.50 arbiter.cluster.local arbiter\u0026#34; | sudo tee -a /etc/hosts NFS mount fails after FreeIPA enrollment: FreeIPA updates /etc/nsswitch.conf. Confirm files appears before sss for passwd and group:\n$ grep -E \u0026#34;^(passwd|group)\u0026#34; /etc/nsswitch.conf passwd: sss files systemd group: sss files systemd If NFS mounts hang after enrollment:\n$ sudo setsebool -P use_nfs_home_dirs 1 Home directory not created on first login:\n$ sudo systemctl enable --now oddjobd Node freezes on boot after NFS setup: A stale resume=UUID in GRUB can cause boot hangs. From the GRUB menu, press e, remove the resume=UUID=... argument, then Ctrl+X to boot. Once up:\n$ grubby --update-kernel=ALL --remove-args=\u0026#34;resume=UUID=\u0026lt;UUID\u0026gt;\u0026#34; 10. What is Next # The cluster now has shared storage and centralized authentication. Every node shares the same home directory and every user has a consistent identity across all nodes.\nNext episode we install Slurm, the job scheduler. With NFS and FreeIPA already in place, Slurm has everything it needs to schedule jobs across nodes and write output files back to a shared location.\nAll configuration files and Ansible playbooks from this episode are in the GitHub repository.\nHappy Computing!\n","date":"5 May 2026","externalUrl":null,"permalink":"/posts/hpc-from-scratch-04/","section":"Posts","summary":"A cluster where each node has its own home directory and user list is not really a cluster. Episode 4 fixes that by installing NFS for shared storage and FreeIPA for centralized authentication, so every node sees the same files and every user logs in with the same credentials from anywhere.","title":"[HPC From Scratch] Episode 4: NFS Storage \u0026 FreeIPA: One Drive, One Login","type":"posts"},{"content":"","date":"5 May 2026","externalUrl":null,"permalink":"/tags/freeipa/","section":"Tags","summary":"","title":"FreeIPA","type":"tags"},{"content":"","date":"5 May 2026","externalUrl":null,"permalink":"/tags/nfs/","section":"Tags","summary":"","title":"NFS","type":"tags"},{"content":"Date: April 28, 2026 Venue: Northeastern University, Boston, MA\nOverview # A hands-on workshop for university researchers who want to scale computation beyond a single CPU core. This session walks through core parallel computing concepts, real benchmark results, and working code examples that can be run directly on the cluster.\nTopics Covered # Serial vs. parallel execution: pipelining and data parallelism Flynn\u0026rsquo;s Taxonomy: SISD, SIMD, MISD, MIMD Shared vs. distributed memory models and when to use each Amdahl\u0026rsquo;s Law, Gustafson\u0026rsquo;s Law, and strong vs. weak scaling CPU parallelism in practice: Conway\u0026rsquo;s Game of Life (serial, OpenMP, MPI+OpenMP) GPU computing fundamentals: CUDA workflow and memory model Scaling ML workloads with PyTorch: single GPU, multi-GPU, and multi-node DDP Parallel tools for Python, R, and MATLAB Mapping parallelism to Slurm: --ntasks vs. --cpus-per-task Materials # Workshop Slides \u0026amp; Materials (GitHub) Workshop Recordings (Spring 2026) ","date":"28 April 2026","externalUrl":null,"permalink":"/talks/neu-talk-02/","section":"Talks \u0026 Workshops","summary":"","title":"Introduction to Parallel Computing","type":"talks"},{"content":"","date":"28 April 2026","externalUrl":null,"permalink":"/tags/parallel-computing/","section":"Tags","summary":"","title":"Parallel Computing","type":"tags"},{"content":"","date":"28 April 2026","externalUrl":null,"permalink":"/talks/","section":"Talks \u0026 Workshops","summary":"","title":"Talks \u0026 Workshops","type":"talks"},{"content":"","date":"28 April 2026","externalUrl":null,"permalink":"/tags/workshop/","section":"Tags","summary":"","title":"Workshop","type":"tags"},{"content":"A laptop, a home router, and a gigabit switch. One isolated cluster subnet.\nIn Episode 2, we upgraded the four M715q nodes with dual-channel RAM and an NVMe drive, and fixed the iGPU memory trap that can crash Slurm jobs. This episode brings the cluster online: installing Rocky Linux, designing the network, and turning a laptop into a DHCP server, NAT gateway, and SSH bastion for the internal cluster subnet.\n*(Click the image to watch the tutorial on YouTube)* 1. The Topology Decision # In production HPC, management and compute networks are strictly wired, physically separated, and connected through managed switches with VLANs. A single enterprise managed switch can cost more than this entire cluster.\nFor a home build, there are two realistic paths:\nFlat home network. Plug every node into the home router. Easy, but every node is exposed to the same network as phones, TVs, and IoT devices. No isolation, and one compromised device can reach the whole cluster. Physical isolation with a dedicated switch. All cluster nodes live on their own subnet behind a cheap unmanaged switch. The login node bridges the two worlds. I went with option 2. The Netgear GS308E provides the isolation. The login node sits at the boundary, handling DHCP, DNS, and NAT for the internal cluster subnet. Worker nodes never see the home network directly.\nThe result is the same pattern used in production HPC: the login node at the edge, an internal fabric behind it, and no direct external exposure for compute nodes. The difference is scale. Gigabit Ethernet instead of InfiniBand. An unmanaged consumer switch instead of a spine-leaf topology. Same architecture, different order of magnitude.\nNote: The HP Envy GPU node (corsair-01) connects to the same switch and gets the same base OS and network setup as every other node. The GPU side of that box will be configured in a later episode.\n2. OS Installation # Every node runs Rocky Linux 10, minimal install. I used the NanoKVM to mount the ISO and drive the installer over my browser, rotating it between machines. A monitor and keyboard work the same way if you do not have a NanoKVM.\nThe installation itself is unremarkable: boot the ISO, select minimal install, pick the boot drive, let it run, reboot.\nTwo things worth pre-planning while installing:\nCreate a sudoer user on every node. This is the account you will SSH into later. Root SSH will be disabled, so without this account you will lock yourself out.\nUse the same username across all nodes. When you run ssh-copy-id later, the local username is assumed by default, so ssh-copy-id arbiter works if the user matches on both sides. When FreeIPA comes in a later episode, it will replace these local accounts with centralized identity, but consistency makes the transition smoother.\n3. Login Node on WiFi # The login node is carrier, a refurbished Lenovo IdeaPad 1 laptop. It has WiFi and one Ethernet port. Most build guides would tell you that a login node should be wired. I put this one on WiFi on purpose.\nWhy WiFi for the external side? The login node needs internet access for package updates, pulling datasets, and remote SSH from outside the home. Running Ethernet to the home router would work, but it would consume one of the eight switch ports I need for cluster nodes, and it would require an extra cable across the room. WiFi removes that constraint at the cost of bandwidth the login node does not actually need.\nWhy Ethernet for the internal side? All heavy traffic (NFS reads, MPI messages, scheduler heartbeats) has to stay on the wired switch at full gigabit. The login node\u0026rsquo;s Ethernet port is the gateway into that fabric.\nThere are three laptop-specific steps to cover before anything else works.\nEssential packages. Throughout the series we will need a compiler, git, and a reasonable editor:\nsudo dnf upgrade -y sudo dnf install -y epel-release sudo dnf install -y vim git wget tree curl gcc-c++ cmake m4 Lid-close fix. By default, closing a laptop lid triggers systemd-logind to suspend the machine. For a login node this is catastrophic: the cluster loses its DHCP server, NAT gateway, and SSH entry point the moment you close the lid. The fix is a one-line change in /usr/lib/systemd/logind.conf:\nHandleLidSwitch=ignore After sudo systemctl restart systemd-logind, the laptop can live closed on top of the cluster stack without suspending.\nRouting priority. With two active interfaces (WiFi and Ethernet), Linux has to decide which one handles outbound internet traffic. It picks based on the route metric: lower metric wins. By default the wired connection often gets a lower metric than WiFi, which means internet traffic would be routed out through the cluster switch, which has no path to the home router. The fix is to force WiFi to have the lowest metric:\nnmcli connection modify \u0026lt;WIFI NAME\u0026gt; ipv4.route-metric 10 nmcli connection down \u0026lt;WIFI NAME\u0026gt; \u0026amp;\u0026amp; nmcli connection up \u0026lt;WIFI NAME\u0026gt; Find the connection name with nmcli connection show. After this, ip route show default should list WiFi as the first (primary) default route.\n4. DHCP: Handing Out IPs # The worker nodes need IP addresses. They have no connection to the home router, so the home router\u0026rsquo;s DHCP cannot reach them. The login node has to become their DHCP server.\nFirst, give the login node a fixed address on the cluster side. Workers will use this as their gateway:\nnmcli connection modify \u0026lt;WIRED NAME\u0026gt; ipv4.addresses 192.168.50.1/24 ipv4.method manual nmcli connection up \u0026lt;WIRED NAME\u0026gt; Now install and configure dnsmasq. I picked it over isc-dhcp-server because it is lightweight, single-binary, and handles both DHCP and DNS. For a six-node cluster, anything more is overkill.\nsudo dnf install -y dnsmasq sudo mv /etc/dnsmasq.conf /etc/dnsmasq.conf.bak The replacement /etc/dnsmasq.conf is about ten lines:\ninterface=\u0026lt;WIRED INTERFACE\u0026gt; dhcp-range=192.168.50.10,192.168.50.50,12h dhcp-option=3,192.168.50.1 dhcp-option=6,1.1.1.1,8.8.8.8 log-queries log-dhcp Find the interface name with nmcli device. Each line does one job:\ninterface= restricts dnsmasq to the wired side only. Without this, dnsmasq would try to answer DHCP requests on WiFi too, which would fight with the home router. dhcp-range= defines the pool of IPs dnsmasq will hand out, and the lease duration (12 hours). dhcp-option=3,192.168.50.1 advertises the login node as the default gateway. This is how workers learn where to send traffic destined for the internet. dhcp-option=6,1.1.1.1,8.8.8.8 tells workers which DNS servers to use (Cloudflare and Google as public fallbacks). log-queries and log-dhcp turn on verbose logging. Invaluable during initial bring-up. Turn them off once the cluster is stable. Open the firewall for DHCP and DNS, then start the service:\nsudo firewall-cmd --permanent --add-service=dhcp sudo firewall-cmd --permanent --add-service=dns sudo firewall-cmd --reload sudo systemctl enable --now dnsmasq Tip: journalctl -u dnsmasq -f on the login node during worker boot shows the full DHCP handshake as it happens (DHCPDISCOVER, DHCPOFFER, DHCPREQUEST, DHCPACK). Very useful for diagnosing why a worker is not getting an address.\n5. NAT: Getting Workers to the Internet # DHCP handed out IPs in the 192.168.50.x range. Those are private addresses, defined by RFC 1918 as non-routable on the public internet. If a worker sends a packet to dnf.rocky.example.com, it goes out to the cluster switch, bounces around, and dies. It has no path out.\nThe fix is Network Address Translation (NAT). The login node rewrites the source address on every outbound packet to its own WiFi-side public IP. Reply packets come back to the WiFi IP, and the login node looks up which internal source the packet belongs to and forwards it back. This is the same trick your home router does for every device in the house.\nTwo pieces are needed.\nIP forwarding. By default, a Linux machine will not forward packets between interfaces. It has to be explicitly allowed:\nsudo sysctl -w net.ipv4.ip_forward=1 echo \u0026#34;net.ipv4.ip_forward = 1\u0026#34; | sudo tee /etc/sysctl.d/99-ipforward.conf The first command enables forwarding immediately. The second persists it across reboots.\nMasquerade rule. With forwarding enabled, the kernel will route packets between interfaces, but it will not rewrite their source addresses. A masquerade rule on firewalld tells the kernel to do that rewriting:\nsudo firewall-cmd --permanent --add-masquerade sudo firewall-cmd --reload Verify:\nsudo firewall-cmd --list-all | grep masquerade Should show masquerade: yes.\nBring workers online and test. Power on a worker node. On the login node, check the leases file:\ncat /var/lib/dnsmasq/dnsmasq.leases Each line contains a timestamp, MAC address, IP, and hostname. SSH into the worker using the sudoer account you created during install:\nssh \u0026lt;user\u0026gt;@192.168.50.11 ping -c 3 1.1.1.1 If the ping works, every piece (DHCP, routing, NAT, DNS) is doing its job.\n6. Hostnames Instead of IPs # Typing IP addresses everywhere gets old fast. Worse, if you ever renumber the subnet, every script, config file, and commit history has the wrong addresses baked in. Hostnames are indirection, and indirection is cheap insurance.\nI use these names:\nHostname IP Role carrier 192.168.50.1 Login node arbiter 192.168.50.50 Management / NFS interceptor-01 192.168.50.15 Compute interceptor-02 192.168.50.32 Compute observer 192.168.50.19 Visualization corsair-01 192.168.50.11 GPU On each node, including the login node:\nsudo hostnamectl set-hostname \u0026lt;HOSTNAME\u0026gt; Then add every node to /etc/hosts on the login node:\n192.168.50.1 carrier.cluster.local carrier 192.168.50.15 interceptor-01.cluster.local interceptor-01 192.168.50.32 interceptor-02.cluster.local interceptor-02 192.168.50.11 corsair-01.cluster.local corsair-01 192.168.50.19 observer.cluster.local observer 192.168.50.50 arbiter.cluster.local arbiter From now on ssh arbiter works instead of ssh 192.168.50.50. This is a stopgap. FreeIPA in a later episode brings up a proper DNS server so hostnames resolve cluster-wide without touching /etc/hosts on each node.\n7. Hardening the Exposed Surface # Only the login node is reachable from the home WiFi. Workers sit behind NAT on their own subnet, so nothing on the home network can reach them directly. Hardening effort goes into carrier.\nThree things matter here: the SSH config itself, brute-force protection, and a small systemd fix specific to laptops.\nSSH drop-in config. Rocky 10\u0026rsquo;s default /etc/ssh/sshd_config includes files from /etc/ssh/sshd_config.d/*.conf, and the first value wins when the same setting appears in multiple files. This is a drop-in config system: you do not edit the main config, you add a new file with only the things you want to change.\nThe only real change I make is disabling direct root SSH login:\nsudo tee /etc/ssh/sshd_config.d/99-custom.conf \u0026gt; /dev/null \u0026lt;\u0026lt;\u0026#39;EOF\u0026#39; PermitRootLogin no EOF sudo sshd -t # validate syntax sudo systemctl reload sshd A few settings stay at their upstream defaults on purpose:\nPublic key authentication is enabled by default. ssh-copy-id works without any config change. Password authentication is also enabled by default, and I keep it. HPC users coming from university clusters are used to password login, and FreeIPA in a later episode will route that through centralized auth anyway. The combination of fail2ban and a decent password policy is a reasonable defense. Host keys (the server\u0026rsquo;s identity, not user keys) auto-load when no explicit HostKey directive is set. Rocky 10 generates RSA, ECDSA, and Ed25519 host keys at first boot. No config needed. Make sure the SSH port is reachable through firewalld:\nsudo firewall-cmd --permanent --add-service=ssh sudo firewall-cmd --reload fail2ban. Port 22 attracts brute-force attempts even on home networks. A compromised IoT device on the same WiFi is enough to start one. fail2ban watches auth logs, and when it sees too many failures from the same IP in a short window, it adds a temporary firewall rule to drop traffic from that IP.\nFollowing upstream fail2ban guidance, the configuration is a short jail.local that overrides only what I want to change:\nsudo dnf install -y fail2ban sudo tee /etc/fail2ban/jail.local \u0026gt; /dev/null \u0026lt;\u0026lt;\u0026#39;EOF\u0026#39; [DEFAULT] bantime = 10m maxretry = 3 [sshd] enabled = true mode = aggressive EOF sudo systemctl enable --now fail2ban Three failed auth attempts from an IP and it is banned for ten minutes at the firewall level. mode = aggressive combines the normal, ddos, and extra SSH filters. normal catches standard auth failures, ddos catches connections that close before authentication completes (a signature of some scanners), and extra adds a few less common patterns. Check what is currently banned with sudo fail2ban-client status sshd.\nCluster-wide passwordless SSH. With key auth enabled and the SSH jail in place, the last piece is distributing keys. On the login node, as your regular sudoer user (not root, which cannot SSH anyway):\nssh-keygen -t ed25519 ssh-copy-id \u0026lt;user\u0026gt;@arbiter ssh-keygen creates a private/public keypair in ~/.ssh/ (id_ed25519 and id_ed25519.pub). ssh-copy-id logs into the target machine with password auth once, then appends your public key to ~/.ssh/authorized_keys on that machine. On subsequent SSH attempts, the server sees your public key, verifies you have the matching private key, and lets you in without asking for a password. Repeat for each worker. After that, ssh arbiter from the login node should not prompt for a password.\nOptional: sshd startup override for laptop login nodes. On this laptop-based login node I ran into occasional boot-time issues where sshd failed to start before the Ethernet interface was fully configured. I did not capture the exact error at the time, so I cannot confirm the root cause with certainty. The standard fix is a systemd override that makes sshd wait for network-online.target and retry on failure. If your sshd is in a failed state after reboot, check journalctl -u sshd -b. If you are building this on desktop or server hardware, you likely do not need it.\nApply it with sudo systemctl edit sshd.service and paste:\n[Unit] Wants=network-online.target After=network-online.target [Service] Restart=on-failure RestartSec=5s StartLimitIntervalSec=0 systemctl edit creates the drop-in file at /etc/systemd/system/sshd.service.d/override.conf and runs daemon-reload automatically.\nInternal Nodes: Firewall Off # Everything in this section so far has been about carrier. The other five nodes are a different story.\nThey sit on 192.168.50.0/24 behind NAT. Nothing on the home WiFi can reach them directly, and the only inbound path is through carrier. firewalld on these nodes adds no real defense, but it does block things that need to work: NFS callbacks, FreeIPA enrollment over Kerberos and LDAP, and the long list of dynamic ports Slurm uses for srun and step launches. Maintaining accurate firewall rules across all of that is tedious and easy to get wrong.\nThe simpler approach, and the standard practice on isolated HPC fabrics, is to turn it off on every node that is not the login node:\nsudo systemctl disable --now firewalld The security boundary is carrier. Inside the boundary, full trust between nodes.\n8. Why WiFi Is Not the Bottleneck # The most common question about this topology is whether WiFi on the login node bottlenecks the cluster. It does not, because traffic paths are asymmetric.\nEditing code over SSH, pulling packages from dnf, running git pull, monitoring the system from a browser: all of this goes out over WiFi. None of it is bandwidth-sensitive. WiFi throughput at a few hundred Mbps is more than enough.\nThe heavy lifting happens entirely on the gigabit switch. When interceptor-01 reads a dataset from arbiter\u0026rsquo;s NFS share, that traffic goes node-to-switch-to-node without ever touching the login node. When an MPI job on two workers exchanges messages, same thing. Full gigabit, predictable latency, no WiFi involvement.\nThe compute fabric is purely wired. The WiFi side is only for management and internet access. Someone streaming 4K video on the home network has zero impact on cluster performance.\n9. What is Next # The cluster is now networked, addressable, and reachable. Every node has an OS. Hostnames resolve. The login node handles DHCP, NAT, and SSH for the internal subnet. From any machine on the home network, I can SSH into carrier and from there into any worker, password-free.\nIn Episode 4, we mount the Samsung 990 Pro on arbiter as shared NFS storage and bring up FreeIPA for centralized user management. After that, a single user account created once will work across every node in the cluster, and all nodes will share a home directory tree.\nAll configuration files and the full command reference for this episode are on GitHub.\nHappy Computing!\n","date":"21 April 2026","externalUrl":null,"permalink":"/posts/hpc-from-scratch-03/","section":"Posts","summary":"Building a home HPC cluster does not require a perfect network on day one. Episode 3 covers setting up the login node on WiFi, planning a clean IP scheme for the cluster subnet, and getting SSH working reliably across all nodes while keeping the setup simple enough to upgrade later.","title":"[HPC From Scratch] Episode 3: Rocky Linux Setup, DHCP, NAT, and SSH Across All Nodes","type":"posts"},{"content":"","date":"21 April 2026","externalUrl":null,"permalink":"/tags/networking/","section":"Tags","summary":"","title":"Networking","type":"tags"},{"content":"Four nodes. 16GB each. One hidden BIOS setting that can crash your Slurm jobs.\nIn Episode 1, we covered the full cluster architecture, cost breakdown, and network layout. This episode focuses on the compute backbone: upgrading the four Lenovo ThinkCentre M715q nodes with dual-channel RAM and NVMe storage, and fixing a BIOS setting that silently eats your memory.\n*(Click the image to watch the tutorial on YouTube)* 1. What We Are Working With # Each M715q is a tiny Micro Form Factor PC. Here is what they shipped with from eBay:\nSpec Stock Configuration CPU AMD Ryzen 5 Pro 2400GE (4C/8T, 35W TDP) RAM 8GB DDR4 SO-DIMM (single stick, single-channel) Boot Drive 240GB 2.5\u0026quot; SATA SSD M.2 Slot Empty (NVMe capable) iGPU AMD Radeon RX Vega 11 The Ryzen 5 Pro 2400GE is a 35-watt part. Quiet and low-power, which matters when you have four of them sitting on your desk. 4 cores and 8 threads per node gives us 32 threads total across all the M715q nodes.\n8GB of single-channel RAM is often not enough for most HPC workloads. And the boot drive being a 2.5\u0026quot; SATA SSD turned out to work in our favor for the storage upgrade.\n2. The Storage Upgrade Path # When I opened the first M715q, I found the 240GB SATA SSD sitting in the 2.5\u0026quot; bay and an empty M.2 NVMe slot on the motherboard.\nBecause the OS boots from the SATA drive, the high-speed M.2 slot is free. I realized that I could install a 1TB Samsung 990 Pro into that slot on the management node. This drive serves as the NFS storage for the entire cluster.\nIf the boot drive had been an M.2 SSD instead (which I believe is a default option for these units), the upgrade path would have been different. I would have bought a standard SATA SSD for NFS instead. You work with the hardware you get.\nA PCIe Gen 4 NVMe drive is probably overkill for a Gigabit Ethernet network. Even more so because the M715q\u0026rsquo;s M.2 slot is PCIe 3.0, so the 990 Pro runs at Gen 3 speeds anyway. The network will bottleneck long before the drive does. We will benchmark the throughput we actually get in a later episode.\nNote: Only the management node gets the NVMe drive. The other three M715q nodes keep their stock 240GB SATA SSDs as boot drives. There is no need for local fast storage on compute nodes when jobs read data from NFS.\n3. RAM Upgrade: 8GB to 16GB Dual-Channel # 8GB is not enough for most HPC workloads. Instead of replacing the existing stick with a single 16GB module, I added a second 8GB stick.\nEach M715q came with one 8GB DDR4 SO-DIMM in one slot which leaves the second slot empty. I bought matching 8GB sticks and installed them in the empty slots. This gives us two benefits:\nDouble the capacity (8GB to 16GB) Dual-channel memory bandwidth Dual-channel matters for compute. With a single stick, the CPU accesses memory through one channel. With two sticks in both slots, it can read and write through two channels simultaneously. This roughly doubles the theoretical memory bandwidth, which directly affects performance in memory-bound workloads like MPI and numerical computation.\nRAM Compatibility\nThe M715q uses DDR4 SO-DIMM (laptop-sized) memory. When buying used RAM, match the specifications as closely as possible to the existing stick:\nSpec What to Match Form Factor DDR4 SO-DIMM Capacity 8GB (to match existing stick) Speed DDR4-2666 or higher (the 2400GE supports up to 2933) Voltage 1.2V (standard DDR4) (M715q I purchased came with DDR4-2666)\nI bought my RAM sticks on eBay. The four sticks cost a total of $78, averaging about $20 per node for the upgrade. If you buy 16GB kits (2x8GB) new, expect to pay more, but compatibility is guaranteed.\nTip: If you are unsure about compatibility, check the spec sheet online. Search for the matching part online.\nInstallation\nOpening the M715q is straightforward. Remove one screw on the back panel, slide the top cover off, and the internals are fully exposed. Once you remove the 2.5\u0026quot; SATA bay (one screw, then slide forward), the two SO-DIMM slots are clearly visible. Push the new stick into the empty slot until the clips snap into place.\nI upgraded all four nodes. The management node took a bit longer because it also got the Samsung 990 Pro NVMe drive. The other three were just RAM, so I ran through them quickly.\n4. The iGPU Memory Trap # This is the part that might save you hours of debugging later.\nAfter installing the RAM, I booted the management node into a Linux Live USB using the NanoKVM (no monitor or keyboard needed). I opened a terminal and ran:\n$ free -m total used free shared buff/cache available Mem: 15661 1656 10369 73 3989 14005 15,661 MiB. We installed 16GiB (16,384 MiB). Where did the other ~700 MiB go?\nThe answer: the integrated Vega GPU. Ryzen APUs share system RAM with the iGPU (integrated GPU). The GPU reserves a portion of your physical memory as video memory (VRAM), and the operating system never sees it.\nI confirmed this with following command:\n$ dmesg | grep VRAM The output showed 256MB allocated to VRAM.\nBIOS Setting: UMA Frame Buffer Size\nThe amount of RAM reserved for the iGPU is controlled by a BIOS setting called UMA Frame Buffer Size. On my units, the default was set to Auto, which allocated 256MB.\nI explicitly set it to 256MB (the lowest available option). Why bother changing it if Auto was already picking 256MB? Because Auto let the firmware decide, and that decision could change after a BIOS update or a hardware configuration change. If the iGPU suddenly grabs 512MB instead of 256MB, your Slurm jobs could start failing and the error messages will not point you to the BIOS.\nPinning it to a fixed value removes the guesswork.\nWhy This Matters for Slurm\nWhen you configure Slurm later in this series, each node\u0026rsquo;s memory must be declared in slurm.conf using the RealMemory parameter. If you set RealMemory=16000 because you installed 16GB, Slurm will try to allocate memory that does not exist. Jobs will crash with out-of-memory errors.\nThe correct approach:\nBoot the node Run free -m and note the total value Use that number (or slightly below it) as RealMemory in your Slurm configuration # Example slurm.conf entry NodeName=interceptor-01 CPUs=8 RealMemory=15600 State=UNKNOWN Every megabyte counts. Document it now, and save yourself the debugging later.\n5. Upgrade Cost Summary # Here is what this episode\u0026rsquo;s upgrades cost:\nItem Count Unit Price (USD) Total (USD) DDR4 8GB SO-DIMM (Micron) 2 15.00 30.00 DDR4 8GB SO-DIMM (Hynix) 2 24.00 48.00 Samsung 990 Pro 1TB NVMe 1 109.90 109.90 Episode Total $187.90 Combined with the four M715q units from Episode 1 ($343.60), the total compute backbone cost so far is $531.50. That covers four Ryzen nodes with 16GB dual-channel RAM each and 1TB of NVMe storage for NFS.\nPer node (excluding the shared NFS drive): roughly $105 for a fully upgraded Ryzen 4C/8T compute node with 16GB of RAM.\n6. What is Next # The compute backbone is assembled and verified. But hardware without a network is just a pile of metal.\nIn Episode 3, we will look at the rest of the cluster: the HP Envy TE01 GPU node with its Intel i7-10700F, the Gigabit network switch that connects everything, and why the login node uses WiFi to bridge the cluster to the internet.\nHappy Computing!\n","date":"25 March 2026","externalUrl":null,"permalink":"/posts/hpc-from-scratch-02/","section":"Posts","summary":"Buying more RAM does not always mean your jobs get more of it. Episode 2 of HPC From Scratch covers the hardware upgrades made to the cluster nodes and explains how iGPU shared memory quietly reduces available system memory on compute nodes, with no error message to warn you.","title":"[HPC From Scratch] Episode 2: RAM, NVMe, and the iGPU Memory Trap","type":"posts"},{"content":"","date":"25 March 2026","externalUrl":null,"permalink":"/tags/hardware/","section":"Tags","summary":"","title":"Hardware","type":"tags"},{"content":"A 6-node cluster for $1,264. No server rack, no enterprise budget.\nThe HPC 101 and Special Topics series covered how to use an HPC cluster. This series covers how to build one.\nOver the next several episodes, I will walk through the full process of building a functional HPC cluster from consumer hardware: sourcing parts, installing the OS, configuring Slurm, setting up identity management with FreeIPA, benchmarking, and upgrading. Every configuration file will be available on my GitHub.\nThis first episode covers what is in the cluster, where I got each part, how the network is laid out, and how this compares to running cloud instances.\n*(Click the image to watch the tutorial on YouTube)* 1. Why Build a Cluster? # There are two common alternatives, and both have trade-offs.\nCloud (AWS, GCP, Azure): Running multi-node compute instances 24/7 gets expensive. Even with a 3-year savings plan, two modest EC2 instances cost over $2,300 per year (see Section 5). That is fine for burst workloads, but it is not practical for always-on experimentation and learning.\nSingle workstation: A high-end desktop gives you raw compute power, but it does not teach you distributed systems. You will never hit a network bottleneck, debug a Slurm scheduling conflict, or troubleshoot MPI on a single machine. You need multiple nodes for that.\nI wanted a miniature version of a real supercomputer architecture that I could test, break, and fix on my desk. It runs the same software stack as a university research cluster: Slurm for job scheduling, FreeIPA for identity management, NFS for shared storage, and MPI for parallel workloads.\n2. Bill of Materials # All prices are what I actually paid between late 2024 and late 2025. Due to recent price increases in the PC parts market, your total may be higher if you replicate this build today.\nItem Count Unit Price (USD) Total (USD) Condition Lenovo IdeaPad 1 1 161.00 161.00 Refurbished Lenovo ThinkCentre M715q 4 85.90 343.60 Used HP Envy TE01 1 400.00 400.00 Used DDR4 SODIMM (Micron) 2 15.00 30.00 Used DDR4 SODIMM (Hynix) 2 24.00 48.00 Used Netgear GS308E 1 21.50 21.50 New Samsung 990 Pro 1TB 1 109.90 109.90 New Sabrent USB-C Hub 1 59.90 59.90 New 10Gbps Cat 6 Ethernet Cable (x5) 1 9.90 9.90 New NanoKVM 1 69.90 69.90 New Rubber Feet 1 9.90 9.90 New Total Cost 1,263.60 Where I sourced these:\nThe four ThinkCentre M715q units and the RAM came from eBay. The HP Envy TE01 was a Craigslist cash deal (no receipt for that one). The Samsung 990 Pro, Netgear switch, USB-C hub, cables, and rubber feet came from Amazon. The NanoKVM was ordered directly from the manufacturer. The IdeaPad 1 was a refurbished unit from Lenovo.\nThe key was patience. I did not buy everything at once. I watched eBay listings for weeks, picked up the Craigslist deal when it appeared, and bought new components during sales. The M715q units averaged under $86 each. At that price, four of them cost less than a single mid-range GPU.\nNote on future upgrades: An RTX 5060 Ti and a new power supply are planned for the GPU node. These are not included in the cost above because they are optional upgrades, not part of the initial build. The GPU upgrade will be covered in a dedicated episode.\n3. Cluster Architecture # Hostname Role Hardware CPU Notes carrier Login Node Lenovo IdeaPad 1 AMD Ryzen 3 7920U (8vCPU, 8GB RAM) WiFi to internet, Ethernet to cluster switch arbiter Management Node Lenovo ThinkCentre M715q Ryzen 5 Pro 2400GE (8 vCPU, 16GB RAM) Slurm controller, FreeIPA server interceptor-01 CPU Compute Lenovo ThinkCentre M715q Ryzen 5 Pro 2400GE (8 vCPU, 16GB RAM) Slurm compute interceptor-02 CPU Compute Lenovo ThinkCentre M715q Ryzen 5 Pro 2400GE (8 vCPU, 16GB RAM) Slurm compute corsair-01 GPU Compute HP Envy TE01 Intel i7-10700F (16 vCPU, ~32GB RAM) GTX 1660 Super (upgrade planned) observer Visualization Lenovo ThinkCentre M715q Ryzen 5 Pro 2400GE (8 vCPU, 16GB RAM) Visual/monitoring tasks Mixing AMD Ryzen and Intel across nodes looks messy at first glance. But in production HPC, heterogeneous architectures are standard.\nTake El Capitan, the world\u0026rsquo;s fastest supercomputer as of the November 2024 TOP500 list. It uses AMD MI300A APUs that pack CPU and GPU cores into a single package. My cluster splits those roles across separate nodes instead. The core idea is the same: different processors handling different parts of a workload. This cluster does that at desk scale.\nAll nodes run Rocky Linux. The software stack includes Slurm 25.11 for job scheduling, FreeIPA for centralized identity and authentication, NFS for shared storage (served from the Samsung 990 Pro), and OpenMPI for parallel workloads. Monitoring runs on Prometheus and Grafana. All configuration is managed through Ansible playbooks.\n4. Network Layout # The network topology is simple.\nAll cluster nodes connect to a Netgear GS308E Gigabit managed switch on a 192.168.50.x subnet. The switch is unmanaged in practice: no VLANs, no trunking. Internal cluster traffic stays physically isolated on this switch.\nThe login node (carrier) has two network interfaces. Its WiFi connects to the home router for internet access. Its Ethernet connects to the cluster switch. This makes the login node a bridge between the outside world and the internal cluster network.\nThis is the same pattern used in production HPC: the login node sits at the boundary between the external network and the internal fabric. The difference is scale and bandwidth. Gigabit Ethernet instead of InfiniBand or Slingshot. A consumer switch instead of a spine-leaf topology.\n5. AWS Cost Comparison # To put the build cost in perspective, here is what a roughly comparable cloud setup would cost on AWS.\nThe comparison uses two c6g.2xlarge instances, which match the CPU compute nodes (interceptor-01 and interceptor-02) in core count and memory. This does not include the management node, visualization node, login node, or GPU node. The actual cluster has more capacity than what is represented by two EC2 instances.\nHome Cluster (2 CPU nodes) AWS EC2 (2x c6g.2xlarge) vCPUs per node 8 8 Memory per node 16 GB 16 GB Architecture x86 (AMD Ryzen 5 Pro) ARM (AWS Graviton2) Network 1 Gbps (managed switch) Up to 10 Gbps Total one-time cost $1,264 N/A Annual cost Electricity only $2,300 (3-yr Savings Plan, N. Virginia) Break-even ~7 months vs. cloud N/A Caveat: This comparison matches node count and memory, not raw performance. The c6g.2xlarge instances use newer ARM (Graviton2) cores and have significantly faster networking. The point is not that the home cluster outperforms EC2. The home cluster does not outperform EC2. But for learning distributed systems, job scheduling, and cluster administration, building your own hardware pays for itself fast and gives you experience that cloud instances cannot.\nThe AWS estimate was generated using the AWS Pricing Calculator with the following configuration: 2x c6g.2xlarge, US East (N. Virginia), Linux, Compute Savings Plans (3-year, no upfront), 24/7 consistent workload.\n6. What is Next # In Episode 2, we will open up the Lenovo ThinkCentre M715q and go through the hardware in detail. I will show you how to install the RAM upgrades and fix a critical BIOS setting where the integrated Vega GPU reserves a chunk of system memory by default.\nAfter that, the series will cover:\nOperating system installation and initial configuration Slurm installation and multi-node job scheduling FreeIPA setup for centralized authentication NFS shared storage configuration GPU upgrade (RTX 5060 Ti swap and power supply replacement) Benchmarking and performance tuning Cable management (yes, eventually) All configuration files and Ansible playbooks will be published on my GitHub as we go.\nHappy Computing!\n","date":"13 March 2026","externalUrl":null,"permalink":"/posts/hpc-from-scratch-01/","section":"Posts","summary":"University clusters have waitlists and cloud HPC gets expensive fast. This series documents building a real 6-node HPC cluster from consumer hardware for $1,264. Episode 1 covers hardware selection, how to assign node roles, and the decisions that actually matter before you buy anything.","title":"[HPC From Scratch] Episode 1: Building a 6-Node HPC Cluster for $1,264","type":"posts"},{"content":"Date: February 24, 2026 Venue: Northeastern University, Boston, MA\nOverview # A hands-on workshop designed for university researchers and faculty who are new to Linux and high-performance computing (HPC) environments. This session covers essential command-line skills needed to navigate and work efficiently on HPC clusters.\nTopics Covered # Navigating the Linux filesystem and managing files Working with text editors and file permissions Environment variables and shell configuration Essential command-line utilities for research workflows Tips for transitioning from GUI-based workflows to the terminal Materials # Workshop Slides \u0026amp; Materials (GitHub) Recording # Workshop Recordings (Spring 2026) ","date":"24 February 2026","externalUrl":null,"permalink":"/talks/neu-talk-01/","section":"Talks \u0026 Workshops","summary":"","title":"Linux Essentials for HPC Researchers","type":"talks"},{"content":"Stop using your laptop as a middleman.\nWelcome to the first HPC Special Topics post. This is a standalone deep dive that builds on concepts from the HPC 101 series.\nIn the Data Transfer post, we learned how to move files using scp and rsync. Those tools work great for laptop-to-cluster transfers. But what about cloud storage?\nImagine this: a professor shares a 200GB dataset on Google Drive. Without the right tool, you would download it to your laptop (2 hours on a good day), then scp it to the cluster (another 2 hours). That is 4 hours of babysitting file transfers.\nWhat if you could skip the laptop entirely and pull data straight from Google Drive to your /scratch directory with a single command?\nThat is exactly what Rclone does.\nWe will also go beyond the basics and explore an optimization technique often discussed by experienced HPC engineers: how parallel threading can drastically change your transfer speeds and when it does not.\n1. Why Rclone? # Rclone is a command-line program to manage files on cloud storage. Think of it as rsync, but for the cloud. It supports over 70 cloud storage providers, including Google Drive, Dropbox, OneDrive, Box, AWS S3, and even SFTP.\n*(Click the image to watch the tutorial on YouTube)* Why does this matter on HPC?\nDirect Transfer: Move data from Google Drive to your cluster\u0026rsquo;s /scratch space without touching your laptop. No more download-upload-download cycles.\nParallelization: Unlike scp which sends one file at a time through a single stream, Rclone can transfer multiple files simultaneously. This is where things get interesting (more on this in Section 4).\nReliability: Rclone handles retries, checksums, and interrupted transfers automatically. If your connection drops at 99%, it picks up where it left off just like rsync -P but for cloud storage.\nVersatility: One tool, 70+ backends. Whether your collaborator shares data on Google Drive, your institution uses Box, or your pipeline stores results on S3, Rclone handles them all with the same interface.\n2. Setting Up Rclone on HPC # Important: Many HPC clusters prohibit running large data transfers on the login node. Check your cluster\u0026rsquo;s policy first. If large transfers are restricted, run Rclone inside a compute job:\n$ srun --pty bash $ module load rclone $ rclone copy ... Step 1: Loading Rclone # On many HPC clusters, Rclone is already available as a module.\n$ module avail rclone $ module load rclone If Rclone is not available as a module, you can install it locally in your home directory:\n# Download and unzip $ curl -O https://downloads.rclone.org/rclone-current-linux-amd64.zip $ unzip rclone-current-linux-amd64.zip # Move the binary to your local bin $ mkdir -p ~/bin $ cp rclone-*/rclone ~/bin/ # Verify $ ~/bin/rclone version Note: If you install locally, make sure ~/bin is in your $PATH, or use the full path ~/bin/rclone when running commands.\nStep 2: Connecting to Google Drive (The Headless Challenge) # This is the step that trips up most beginners. Since your HPC cluster does not have a web browser, you must use the headless setup to authenticate.\nRun rclone config and choose n for a new remote. Name it something memorable (e.g., gdrive). Select the provider number for Google Drive. For all other prompts (Client ID, Secret, Scope, Root Folder ID, Service Account, Advanced config), just press Enter to accept the defaults. When asked \u0026ldquo;Use auto config?\u0026rdquo;, choose n. This is crucial for remote servers without a browser. Rclone will provide a URL. Copy and paste this URL into your local laptop\u0026rsquo;s browser. Log in to your Google account, authorize Rclone, and copy the verification code back into your HPC terminal. When asked about Team Drive, choose n (unless you use one). Confirm with y to save. (Check the Rclone overview of cloud storage systems for detailed steps to connect to other cloud providers.)\n$ rclone config # Follow the prompts above # ... # Verify the connection $ rclone lsd gdrive: # You should see your Google Drive folders listed If you see your folders, you are connected.\nTip: The same process works for Dropbox, OneDrive, and Box. Just choose a different provider number in step 3. Each provider has slightly different authentication steps, but Rclone walks you through them interactively.\n3. Essential Commands # Before we dive into optimization, let\u0026rsquo;s cover the commands you will use daily.\nListing and Browsing\n# List top-level directories in your cloud $ rclone lsd gdrive: # List files in a specific folder $ rclone ls gdrive:my_project/data # Show directory tree (great for exploring) $ rclone tree gdrive:my_project --max-depth 2 # Check storage usage $ rclone about gdrive: Copying Data\n# Cloud -\u0026gt; Cluster (the most common use case) $ rclone copy gdrive:my_data ~/scratch/my_data -P # -P: Shows real-time progress, speed, and ETA # Cluster -\u0026gt; Cloud (backing up results) $ rclone copy ~/scratch/results gdrive:results -P Copy vs. Sync: Know the Difference\n# copy: Only adds new files. Never deletes anything at the destination. $ rclone copy gdrive:data ~/scratch/data -P # sync: Makes destination identical to source. DELETES files at # destination that don\u0026#39;t exist at source. Use with caution! $ rclone sync gdrive:data ~/scratch/data -P Warning: rclone sync will delete files at the destination that are not present at the source. Always double-check your command before running sync. When in doubt, use copy.\nAt this point, you have everything you need to use Rclone as a daily tool. The next sections explore how to make it faster.\n4. The Optimization Challenge: Threads vs. Bandwidth # *(Click the image to watch the tutorial on YouTube)* A common insight among experienced HPC engineers is that in many real-world WAN scenarios, a single TCP stream cannot fully utilize available bandwidth due to latency, TCP window limits, and provider-side throttling. The solution? Open more streams.\nRclone has a key flag for this:\n--transfers=N # Number of files to transfer in parallel (default: 4) This raised a few questions worth testing:\nDoes increasing threads always make things faster? Is there a point of diminishing returns? Does uploading (send) behave the same as downloading (receive)? The Experiment\nEnvironment: 4-core HPC compute node, 1Gbps network, Rclone with default Google Drive API (shared client ID). Scenario A: A single large 5GB file (generated with /dev/urandom to prevent compression shortcuts). Scenario B: 1,000 small files (1MB each, also random data). Variable: --transfers set to 1, 4, 8, 16, and 32. Repetitions: 3 runs per condition to ensure consistency. 5. Benchmark Results # Scenario A: The Single Giant (5GB) # (Transfer time for a single 5GB file across different thread counts.)\nThe line is flat. Whether you set --transfers to 1 or 32, the transfer time barely changes.\nWhy? Because --transfers controls file-level parallelism. It determines how many files are transferred simultaneously. If you only have one file, there is nothing to parallelize. One file, one stream, regardless of the thread count.\nThis is a common misconception: --transfers=16 does not split a single file into 16 chunks. It opens 16 slots for 16 separate files.\nAdvanced Note: Rclone does provide --multi-thread-streams for chunk-level parallel downloads of single large files on supported backends. However, this works only for downloads and its effectiveness varies by provider. For most use cases, the --transfers flag covered here is what you want.\nTakeaway: For large single files, increasing --transfers has no effect. The transfer speed is determined by your network bandwidth and the cloud provider\u0026rsquo;s per-stream throughput.\nScenario B: The Small File Storm (1,000 × 1MB) # This is where threading shines.\n(Transfer time for 1,000 small files (1MB each) across different thread counts.)\nWith a single thread, uploading 1,000 files took 1,293 seconds (over 21 minutes). At 8 threads, it dropped to 199 seconds (about 3 minutes). That is a 6.5x speedup just by changing one flag.\nDownloads tell a slightly different story: 1 thread took 307 seconds, while 4 threads brought it down to 93 seconds (a 3.3x improvement). But beyond 4 threads, download speed barely changed.\nWhy are small files so sensitive to threading? Each file transfer involves API calls, metadata verification, checksum validation, and connection overhead. With a single thread, you wait for all of this to complete before starting the next file. Multiple threads hide this per-file latency by overlapping transfers, which is why the speedup is so dramatic.\n6. Finding the Sweet Spot # (Speedup factor relative to single-thread baseline.)\nThe Plateau Effect # Performance gains essentially stop after 8 threads. Why?\nAPI Rate Limits. Google Drive (and most cloud providers) limit the number of API requests per second. Adding more threads beyond the provider\u0026rsquo;s limit just leads to throttling and retries. This is especially strict when using the default shared API client ID that all Rclone users share.\nTip for Power Users: Creating your own Google API client ID can significantly increase your API quota and may shift the optimal thread count higher. See the Rclone Google Drive documentation for details.\nOverhead. Managing 32 concurrent transfers creates its own overhead which is connection setup, checksum verification, and retry logic. They all compete for resources.\nSend (Upload) vs. Receive (Download) # Notice that downloading is significantly faster and saturates earlier than uploading across all conditions.\nWhen you upload, the cloud provider must verify, index, and store each file as it arrives. When you download, the provider serves files from optimized CDN infrastructure with less per-file processing overhead. This asymmetry means your optimal --transfers value may differ depending on the direction of your transfer.\nEfficiency: Why 8 Is the Magic Number # We can measure how efficiently each thread contributes to speedup:\n$$ Efficiency = \\frac{Speedup}{Number \\: of \\: Threads} \\times 100\\% $$ Threads Send Speedup Efficiency 1 1.0x 100% 4 3.9x 98% 8 6.5x 81% 16 6.6x 41% 32 6.6x 21% At 8 threads, you get 81% efficiency, and each thread is pulling its weight. At 32 threads, efficiency drops to 21%. You are using 4x the resources for essentially zero additional speedup.\nFor this specific setup (1Gbps network, default Google Drive API client), 8 threads was the sweet spot. Your optimal number may differ depending on your network speed, cloud provider, and API configuration, but the methodology for finding it is the same: test, measure, compare.\nNote: These numbers are specific to Google Drive with the default shared API client ID. Your results may vary depending on the cloud provider, network speed, and API configuration. The methodology, however, applies universally.\n7. Summary \u0026amp; Recommendations # Rclone is more than a convenience tool. It is a direct pipeline between your cloud storage and your cluster.\nKey Takeaways:\nSkip the laptop. Use Rclone to transfer data directly between cloud and cluster. Threads matter for small files. Threads hide per-file latency overhead. Thousands of files? Use --transfers 8 or --transfers 16. Threads do not help single large files. --transfers is file-level parallelism, not file-splitting. Uploads and downloads behave differently. Downloads saturate earlier. Plan accordingly. Don\u0026rsquo;t overdo it. Setting threads to 64 will likely trigger API throttling and slow you down. Pack when possible. Even with Rclone, 100,000 tiny files will be slow. Consider using tar to bundle them first (as we covered in the Data Transfer post). Scenario Recommended Command Many small files rclone copy remote:path local:path --transfers 8 -P Few large files rclone copy remote:path local:path -P Directory sync rclone sync remote:path local:path -P (use with caution) Check before transfer rclone lsd remote: and rclone about remote: What is Next?\nWe have added another essential tool to our HPC toolkit. In the next series, we will shift gears completely from using the cluster to building one. We will talk about hardware, networking, and how to turn a pile of parts into a working HPC system.\nSee you in the next series!\nHappy Computing!\n","date":"16 February 2026","externalUrl":null,"permalink":"/posts/hpc-special-topics-01/","section":"Posts","summary":"Running rclone copy with default settings works, but it is usually much slower than it needs to be. This post covers how to configure Rclone for HPC workflows, benchmark real transfer speeds across cloud storage providers, and tune the parameters that actually affect throughput on a cluster network.","title":"[HPC Special Topics] Rclone for HPC: Benchmarking and Tuning Cloud Storage Transfers","type":"posts"},{"content":"","date":"16 February 2026","externalUrl":null,"permalink":"/tags/cloud-storage/","section":"Tags","summary":"","title":"Cloud Storage","type":"tags"},{"content":"","date":"16 February 2026","externalUrl":null,"permalink":"/tags/performance-tuning/","section":"Tags","summary":"","title":"Performance Tuning","type":"tags"},{"content":"","date":"16 February 2026","externalUrl":null,"permalink":"/tags/rclone/","section":"Tags","summary":"","title":"Rclone","type":"tags"},{"content":"In the real world, hitting \u0026ldquo;Submit\u0026rdquo; is just the beginning.\nSo far, we have covered the essentials: Logging in, Moving Data, and Managing Environments. Finally, you submitted your job.\nBut sometimes, things go wrong.\nYour job stays \u0026ldquo;Pending\u0026rdquo; forever. It crashes 2 seconds after starting. It runs for 3 days but produces empty files. Today, we will learn the \u0026ldquo;Survival Skills\u0026rdquo; for HPC. We will cover how to debug failed jobs, how to check your resource efficiency, and why you are stuck in the queue.\n*(Click the image to watch the tutorial on YouTube)* 1. In-depth Monitoring (scontrol) # You submitted a job. You type squeue --me. It says P (Pending). Ok, but after 10 minutes, it\u0026rsquo;s still pending. Or maybe it\u0026rsquo;s running, but you don\u0026rsquo;t know where.\n$ squeue --me JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 12345 cpu bash user123 P 0:00 1 (Priority) 12346 gpu bash user123 P 0:00 1 (Resources) squeue gives you a quick summary, but sometimes you need the Full Report. Use the command scontrol show job \u0026lt;JOBID\u0026gt;.\n$ scontrol show job 12345 JobId=12345 JobName=bash UserId=user123(123456) GroupId=users(1000) ... JobState=PENDING Reason=Resources ... StartTime=2026-01-25T21:00:00 EndTime=Unknown NodeList=(null) WorkDir=/home/user123/my_project Command=/bin/bash ... Key fields to look for:\nJobState \u0026amp; Reason: Tells you exactly why it is waiting (e.g., Resources, Priority). StartTime: The scheduler\u0026rsquo;s estimated start time. (Note: This can change if higher priority jobs enter the queue). NodeList: If running, this shows which specific compute node you are using (e.g., compute-node-01). WorkDir: Confirms where your script is running and where output files will be saved. Linux Tip: What is grep? The output of scontrol is very long. We can filter it using a pipe | and grep.\n| (Pipe): Takes the output of the left command and passes it to the right command. grep: Think of it as \u0026ldquo;Ctrl + F\u0026rdquo; for the terminal. It prints only the lines containing your keyword. # Show me ONLY the StartTime line $ scontrol show job 12345 | grep StartTime StartTime=2026-01-25T22:00:00 EndTime=2026-01-25T23:00:00\u0026gt; 2. The Emergency Button (scancel) # Oops! You just realized you requested 100 nodes instead of 1 node. Or maybe your code is stuck in an infinite loop.\nDon\u0026rsquo;t just let it fail. Kill it immediately.\n# Cancel a specific job $ scancel 12345 # Cancel ALL jobs by user $ scancel -u user123 # Cancel a specific job $ qdel 12345 # Cancel ALL jobs (depends on system, usually manual loop or specific command) $ qselect -u user123 | xargs qdel 3. The Detective Work (sacct \u0026amp; Logs) # You came back from coffee, and your job is gone from the queue. Did it finish? Or did it fail? Since it is not in the queue (squeue), we need to check the History.\nStep 1: Check the State (sacct) # The command is sacct (Slurm Accounting). By default, the output is messy, so we use format options.\n$ sacct -j 12345 --format=JobID,State,AllocCPUS,ReqMem,MaxRSS,Elapsed,ExitCode JobID State AllocCPUS ReqMem MaxRSS Elapsed ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 12345 FAILED 1 2G 00:10:15 137:0 12345.batch FAILED 1 00:10:15 137:0 Common States:\nCOMPLETED: Success! (Exit Code 0:0) CANCELLED: The job was killed. TIMEOUT: The job ran longer than the requested --time. FAILED: The code crashed (Non-zero exit code). Step 2: Read the Logs # sacct tells you what happened, but not why. To find the \u0026ldquo;why\u0026rdquo;, look at the output file you defined in your script (e.g., #SBATCH -o result.out).\n# Look at the END of the file first $ tail -n 20 result.out Common Error Messages:\ncommand not found: Did you module load? ModuleNotFoundError: Did you conda activate or install the package? killed / oom-kill: You ran out of memory. Step 3: Get Notified (Pro Tip) # Jobs often fail when you are not watching. Let Slurm email you. Add this to your job script:\n#SBATCH --mail-type=FAIL,END #SBATCH --mail-user=you@email.com FAIL: Notify only when it crashes. END: Notify when it finishes (success or failure). 4. Resource Efficiency (seff) # This is the most important part for becoming a \u0026ldquo;Power User\u0026rdquo;.\nImagine you reserved a banquet table for 40 people, but you ate dinner alone. The restaurant manager (Scheduler) would be angry. In HPC, this happens when you request --cpus-per-task=40 but your python script only uses 1 core.\nHow do you check your efficiency? Use seff.\n$ seff 12345 Job ID: 12345 Cluster: cluster User/Group: user123/users State: COMPLETED (exit code 0) Cores: 8 CPU Utilized: 00:01:25 CPU Efficiency: 10.23% of 00:01:30 core-walltime Job Wall-clock time: 00:01:30 Memory Utilized: 12.09 MB Memory Efficiency: 0.15% of 8.00 GB (8.00 GB/node) Note: Some clusters may not have seff enabled. In that case, use sacct with AveCPU, MaxRSS.\n$ sacct -j 12345 --format=JobID,State,AveCPU,MaxRSS obID State AveCPU MaxRSS ------------ ---------- ---------- ---------- 12345 COMPLETED 12345.batch COMPLETED 00:01:30 12384K How to interpret the output:\nCPU Efficiency:\nBad (\u0026lt; 50%): You requested too many cores. If your code is not parallelized, request only 1 core.\nGood (~ 90%): You are utilizing resources well.\nMemory Efficiency:\nBad (\u0026lt; 10%): You requested too much RAM. Reduce --mem next time.\nDangerous (\u0026gt; 95%): You are on the edge of crashing (OOM). Increase --mem slightly (e.g., by 20%).\nWhy does this matter? Smaller jobs fit into \u0026ldquo;gaps\u0026rdquo; in the cluster easier. By requesting only what you need, your jobs will start faster!\n5. Why is my job pending? (Fairshare) # Sometimes, your job stays in PD (Pending) state with reason Priority or Resources, even though there seem to be empty nodes.\nThis is likely due to Fairshare. Think of it as a \u0026ldquo;Karma System\u0026rdquo;.\nThe cluster is a shared resource. If you ran thousands of heavy jobs last week, your \u0026ldquo;Karma\u0026rdquo; goes down. You wait in line. If you haven\u0026rsquo;t used the cluster for a while, your \u0026ldquo;Karma\u0026rdquo; is high. You jump the queue. Checking the Reason Explicitly\nInstead of guessing, you can ask Slurm exactly why you are waiting:\n$ squeue -j 12345 -o \u0026#34;%.18i %.9T %.30R\u0026#34; JOBID STATE NODELIST(REASON) 12345 PENDING (Priority) This reveals the specific REASON code:\nPriority: Just wait. It\u0026rsquo;s Fairshare logic. Resources: The cluster is busy, or you requested a specific node that is busy. QOSMaxJobsLimit: You hit the limit of allowed running jobs. Dependency: It\u0026rsquo;s waiting for another job to finish. Don\u0026rsquo;t panic. Usually, you just need to wait.\n6. Summary \u0026amp; Cheatsheet # Debugging Mindset (Read This Once) # If a job fails, always ask these questions in order:\nDid it start? (squeue, scontrol) -\u0026gt; If not, check your script syntax. Did it finish or crash? (sacct) -\u0026gt; Check the State. Why did it crash? (logs) -\u0026gt; Read the .err file. Did I request the right resources? (seff) -\u0026gt; Check memory usage. Can I make it smaller? -\u0026gt; Smaller jobs run faster. Congratulations! You have officially graduated from HPC 101. You are no longer just a guest; you are a resident of the cluster.\nGoal Command Check Details scontrol show job \u0026lt;JOBID\u0026gt; Kill Job scancel \u0026lt;JOBID\u0026gt; Check History sacct -j \u0026lt;JOBID\u0026gt; Check Efficiency seff \u0026lt;JOBID\u0026gt; What\u0026rsquo;s Next? In the next series, we will change gears completely. We will stop being a \u0026ldquo;User\u0026rdquo; and start thinking like an \u0026ldquo;Engineer\u0026rdquo;. I will start a new series on How to Build an HPC Cluster from scratch.\nSee you in the next series!\n","date":"28 January 2026","externalUrl":null,"permalink":"/posts/hpc101-04/","section":"Posts","summary":"Job failures on HPC clusters are frustrating, especially without clear feedback. This post breaks down how to use sacct, seff, and Slurm log files to figure out exactly what went wrong and how to prevent it next time.","title":"[HPC 101] Job Debugging: Why Did My Job Fail?","type":"posts"},{"content":"","date":"28 January 2026","externalUrl":null,"permalink":"/tags/debugging/","section":"Tags","summary":"","title":"Debugging","type":"tags"},{"content":"","date":"28 January 2026","externalUrl":null,"permalink":"/series/hpc-101/","section":"Series","summary":"","title":"HPC 101","type":"series"},{"content":"","date":"28 January 2026","externalUrl":null,"permalink":"/tags/seff/","section":"Tags","summary":"","title":"Seff","type":"tags"},{"content":"","date":"28 January 2026","externalUrl":null,"permalink":"/tags/troubleshooting/","section":"Tags","summary":"","title":"Troubleshooting","type":"tags"},{"content":"It looks like it worked. But you just created a hidden mess.\nWelcome back to the HPC 101 series.\nIn the previous post, we learned how to transfer data. Now, you are ready to run your Python code. You log in, type pip install numpy, and hit Enter.\nUnlike the old days, you might not see a \u0026ldquo;Permission Denied\u0026rdquo; error. Instead, you see this:\n[user123@compute-node-01 ~]$ pip install numpy Defaulting to user installation because normal site-packages is not writeable Collecting numpy Using cached numpy-2.3.5... Installing collected packages: numpy Successfully installed numpy-2.3.5 It says \u0026ldquo;Successfully installed\u0026rdquo;. So everything is fine, right?\nNo. You just fell into the \u0026ldquo;User Install\u0026rdquo; trap.\nToday, we will learn why this \u0026ldquo;automatic\u0026rdquo; installation is dangerous in HPC and how to build a proper \u0026ldquo;Private Laboratory\u0026rdquo; using Virtual Environments.\n*(Click the image to watch the tutorial on YouTube)* 1. The Trap: The \u0026ldquo;Backpack\u0026rdquo; Problem # When the system notices you cannot write to the global library, it quietly installs packages into your hidden home folder (usually ~/.local/lib/python3.x/site-packages).\nLet\u0026rsquo;s use an analogy.\nSystem Python: This is the Restaurant Kitchen Pantry. It has standard ingredients. You are not allowed to touch it. User Install (pip install): This is your Backpack. Since you can\u0026rsquo;t use the pantry, you stuff ingredients into your backpack. Virtual Environment: This is a Separate Lunchbox. Why is the Backpack (User Install) bad?\nNo Isolation (Dependency Hell): Project A needs NumPy 1.20 and Project B needs NumPy 2.0. If you put them both in your backpack, they get squashed together. You broke Project A to fix Project B.\n2. The Solution: Your Private Lunchbox # Instead of stuffing everything into one backpack, you should use Virtual Environments.\nThink of a Virtual Environment as your own private lunchbox.\nIsolation: You create a \u0026ldquo;Project A Box\u0026rdquo; and a \u0026ldquo;Project B Box\u0026rdquo;. They never touch each other. Safety: Even if you mess up the installation in one box, you just throw that box away. Your other projects are safe. In HPC, using environments is not just \u0026ldquo;good practice\u0026rdquo;. It is the only way to survive.\n3. Choose Your Tool: Conda vs. Venv # There are two main tools for creating environments: Conda and Venv. Which one should you use?\nWhat is it?\nA cross-platform package manager that installs Python packages and external libraries (C, C++, CUDA).\nPros: Manages Python Versions: You can create an environment with Python 3.8 today and Python 3.12 tomorrow. Binary Dependencies: Handles complex libraries with GPU support (CUDA/cuDNN) automatically.\nCons: Heavy: It takes up more disk space than venv. Slow: Sometimes the \u0026ldquo;solver\u0026rdquo; takes a long time to resolve dependencies. Shell Pollution: Improper use of conda init can break your terminal (See Step 2).\nNote: Due to recent Anaconda licensing changes, many HPC centers are transitioning from Anaconda/Miniconda to Miniforge, which uses the free conda-forge channel by default. The commands and workflow are identical. See the Anaconda Terms of Service for details.\nRecommendation: Best choice for Science, Engineering, and AI/ML projects.\nWhat is it?\nA built-in Python module that creates lightweight virtual environments.\nPros: Lightweight \u0026amp; Fast: Built into Python, creates environments instantly. Clean: Doesn\u0026rsquo;t touch your shell configuration files.\nCons: Limited: Cannot install non-Python tools (like CUDA drivers) easily. Dependent: You are tied to the system\u0026rsquo;s Python version (if system has Python 3.6, your venv is 3.6).\nRecommendation: Good for simple Python scripts or pure software development.\n4. Let’s Build Your Environment # Let\u0026rsquo;s get practical. Here is how you set up your environment.\nStep 1: Create the Environment # # 1. Load the module $ module load miniconda3 # 2. Create environment $ conda create --name myenv python=3.13 # To store in your Lab\u0026#39;s group directory to save space in Home $ conda create --prefix /projects/myLAB/myCondaEnv python=3.13 # 1. Load the python module $ module load python/3.13.10 # 2. Create environment $ python3 -m venv /projects/myLAB/myenv # To store in your Lab\u0026#39;s group directory to save space in Home $ python3 -m venv /projects/myLAB/myVenv Step 2: Activate (The Safe Way) # WARNING: Do NOT run conda init\nMany tutorials tell you to run conda init. In HPC, this is dangerous. It modifies your .bashrc file to automatically activate the (base) environment every time you log in. This causes:\nConflict with system modules (OpenMPI, GCC). Open OnDemand Failure: It may prevent Jupyter or RStudio sessions from starting. If you already ran it, disable auto-activation:\n$ conda config --set auto_activate_base false Instead, use source activate \u0026lt;ENV\u0026gt; or the full path.\n# The \u0026#34;HPC Safe\u0026#34; way (Recommended) $ source activate /projects/myLAB/myCondaEnv # OR if you are using module system properly: $ conda activate /projects/myLAB/myCondaEnv # Source the activate script $ source /projects/myLAB/myVenv/bin/activate Step 3: Install Packages # # Handles binary deps (CUDA, etc.) better $ module load cuda/12.8 # Make sure to match the cuda version (myCondaEnv) $ conda install -c conda-forge cupy cuda-version=12.8 (myCondaEnv) $ conda install numpy pandas # Works in both Conda and Venv (myVenv) $ pip install matplotlib huggingface-hub Rule of Thumb:\nNever run pip install unless you are inside an activated environment.\nStep 4: Deactivate # # For conda $ conda deactivate # For venv $ deactivate 5. Maintenance: Cleaning the Trash (Cache) # One day, you might see this error: Disk quota exceeded.\nYou check your folder, and it seems small. Where did the space go? Both Conda and Pip store downloaded files in a hidden Cache folder (~/.conda/pkgs or ~/.cache/pip). These can grow to 10GB+ easily.\nThe Clean Way:\n# Remove unused packages and caches $ conda clean --all # Remove pip cache $ pip cache purge The \u0026ldquo;Nuclear\u0026rdquo; Option: If your disk is 100% full, the commands above might fail (because they can\u0026rsquo;t create a lock file). In that case, you have to delete them manually.\n# WARNING: Be careful with rm -rf # For conda $ rm -rf ~/.conda/pkgs/* # For pip $ rm -rf ~/.cache/pip/* Don\u0026rsquo;t worry, deleting cache won\u0026rsquo;t break your installed environments. It just deletes the downloaded installers.\nNote: By default, Conda stores downloaded packages in ~/.conda/pkgs. On most HPC clusters, your home directory has a strict storage quota, so this can fill up fast. You can redirect the cache to a scratch or project directory:\n[user@login ~]$ conda config --add pkgs_dirs /scratch/user123/pkgDir Or edit ~/.condarc directly:\npkgs_dirs: - /scratch/user123/pkgDir 6. Summary \u0026amp; Cheatsheet # Using environments on HPC is about keeping your workspace clean and avoiding the \u0026ldquo;Disk Quota Exceeded\u0026rdquo; error.\nAction Conda Command Venv Command Create conda create --prefix \u0026lt;path\u0026gt; python -m venv \u0026lt;path\u0026gt; Activate source activate \u0026lt;path\u0026gt; source \u0026lt;path\u0026gt;/bin/activate Install conda install / pip install pip install Clean conda clean --all pip cache purge My Advice: For new installations, Miniforge is the safest choice for AI/HPC projects. It gives you the same Conda workflow with conda-forge packages and no licensing concerns. If your cluster already provides Miniconda or Anaconda, those work just fine too. And please, avoid conda init to keep your login clean.\nHappy Computing!\n","date":"18 January 2026","externalUrl":null,"permalink":"/posts/hpc101-03/","section":"Posts","summary":"Shared HPC clusters do not give you root access, and the system Python is not yours to modify. This post shows how to set up isolated Python environments using Conda and venv so you can install whatever you need without permission errors or breaking someone else’s workflow.","title":"[HPC 101] Python on HPC: Conda and venv Without Root Access","type":"posts"},{"content":"","date":"18 January 2026","externalUrl":null,"permalink":"/tags/conda/","section":"Tags","summary":"","title":"Conda","type":"tags"},{"content":"","date":"18 January 2026","externalUrl":null,"permalink":"/tags/miniconda/","section":"Tags","summary":"","title":"Miniconda","type":"tags"},{"content":"","date":"18 January 2026","externalUrl":null,"permalink":"/tags/venv/","section":"Tags","summary":"","title":"Venv","type":"tags"},{"content":"Let\u0026rsquo;s turn that scary black screen into a hacker\u0026rsquo;s playground.\nLinux beginners usually consider the black terminal screen as a scary tool that might explode if they touch a wrong key. You don\u0026rsquo;t want to accidentally press a button that blows up your \u0026ldquo;home\u0026rdquo; directory. But once you get used to it, this screen makes you look like a cool hacker.\nBy the end of this post, you’ll be able to move around the Linux terminal, manage files, and edit text without panic.\n*(Click the image to watch the tutorial on YouTube)* 1. The Dark House # Navigating the CLI (Command Line Interface) is like waking up in the middle of the night.\nIf someone takes your pretty looking GUI (Graphical User Interface) away and throws you into a CLI screen, it might feel like a power outage.\nImagine you took a nap on a couch and woke up at 3 AM. It is pitch black and you know how to get to your bed, but you can\u0026rsquo;t see anything on the way.\nUsing a terminal is exactly the same. You need to verify where you are and what is around you before you take a step. If you know exactly where to go, you can walk straight to your room. But you should always be careful not to trip.\n2. Navigating Your Home # Okay, it\u0026rsquo;s 3 AM, the lights are out, and you don\u0026rsquo;t have a flashlight. You want to go to your bed.\nFirst, you need to locate yourself. This is what pwd (Print Working Directory) does. It tells you exactly where you are standing.\n$ pwd /home/my_family/first_floor Before you move, you want to know what\u0026rsquo;s around you so you don\u0026rsquo;t have to kick the table. The ls (List) command is your hands feeling the surroundings. You can add options to see hidden items or extra details.\n$ ls couch kitchen lamp restroom staircase trash_bin TV $ ls -a couch kitchen lamp .phone .remote restroom TV user123_room # Now you found a phone and a remote hidden under the couch! # (Files starting with \u0026#39;.\u0026#39; are hidden in Linux) $ ls -l total 32 -rwxr-xr-x. 1 family family 4096 Dec 27 20:22 couch drwxr--r--. 1 family family 4096 Dec 6 20:02 kitchen -rwxr--r--. 1 family family 10517 Dec 26 18:51 lamp drwxr-xr-x. 1 family family 4096 Dec 26 17:49 restroom -rwxr-xr-x. 1 parents family 840 Dec 26 18:03 TV -rwxr-xr-x. 1 family family 840 Dec 26 18:03 trash_bin drwxr-xr-x. 1 user123 family 4096 Dec 26 18:03 user123_room # You can see full details about each item or room # You won\u0026#39;t see hidden items here Note: You can combine -a and -l as -la to see full details of all items\nThe detailed view (-l) shows some cryptic codes. Don\u0026rsquo;t worry about it though it looks like secret codes, but we can decode them.\n-rwxrw-r-- 1 user group 46 Feb 14 16:37 File.txt ^ ^ ^ ^ ^ ^ ^ ^ | | | | | | | | 1 2 3 4 5 6 7 8 File Type: - (File) or d (Directory) Permissions: rwxrw-r-- (Who can do what) Owner \u0026amp; Group: Who owns this item Size \u0026amp; Time: How big and when it was last touched Let\u0026rsquo;s take a closer look at the first part. It defines who is allowed to enter the room or touch the item.\nType Owner Group Others [-] [rwx] [rw-] [r--] | | | | File Read Read Read Write Write Exec If you see rwx, it means the owner has a full power (Read, Write, and Execute).\nNow, we learned how to read a map, so let\u0026rsquo;s start moving.\nLet\u0026rsquo;s go to the restroom first. Since you can see it in your list, you can walk straight in. Use the cd (Change Directory) command.\n$ cd restroom $ pwd /home/my_family/first_floor/restroom $ ls bath_tub body_wash hand_soap shampoo shower sink toilet towel You finished your business and want to go to your bedroom (user123_room). But wait, ls shows no room here! You have two options:\nStep out to the hall, check locations, and then enter your room. $ cd .. $ cd user123_room Go out and immediately enter your room in one go. $ cd ../user123_room What is \u0026ldquo;..\u0026rdquo;? In Linux, a single dot . represents Here (Current location), and double dots .. represent Parent location (One level up). (Unfortunately, it stops at 2 dots. There is no \u0026ldquo;...\u0026rdquo; or \u0026ldquo;....\u0026rdquo;)\nRelative vs. Absolute Path Using dots (. or ..) works within your house or close to your current position. But what if you are at a friend\u0026rsquo;s house or far way from your room? You can get out of your friend\u0026rsquo;s restroom (..), but you won\u0026rsquo;t find your room (user123_room) there. In that case, you need an Absolute Path (Full address).\n# Relative path (Works only if you are in the hallway) $ cd user123_room # Absolute path (Works from anywhere in the universe) $ cd /home/my_family/user123_room 3. Magic Spells: File Operations # Now, we need some more imagination. You are not just a person in the dark but you are a Wizard. Your magic wand can create rooms and items, or make trash disappear.\nCreation (mkdir, touch)\nFirst, let\u0026rsquo;s create an empty room. The spell is mkdir (Make Directory).\n$ mkdir new_room If you want to create an item (an empty file), use touch.\n$ touch magic_scroll.txt Teleportation (mv)\nYou want to move a TV from the living room to your new room. Cast mv (Move).\n$ mv /home/my_family/first_floor/TV /home/my_family/first_floor/new_room/ Note: In Linux, renaming is just moving a file to the same place with a new name.\n$ mv old_name.txt new_name.txt Cloning (cp)\nIf you take the TV, your dad will be sad. Let\u0026rsquo;s create a clone of it using cp (Copy).\n# Copy TV to the parent directory $ cp ./TV ../ Now everyone is happy!\nDestruction (rm)\nYour mom asked you to take out the trash. With your magic power, you can simply incinerate it. Use rm (Remove).\n$ rm ./trash_bin Tip: When using rm, prefer relative paths so you clearly see what you\u0026rsquo;re deleting.\nWarning: Unlike Windows/Mac, Linux rm won\u0026rsquo;t keep trash (files) in a ** Recycle Bin.** When you rm a file, it\u0026rsquo;s gone forever. It is incinerated. So, please be careful when you cast this spell.\n4. X-Ray Vision (Checking Files) # While you were out, your parents left a note on the table.\nHey, we are leaving to pick up your cousin from the airport.\nPlease clean up the kitchen.\nDon\u0026rsquo;t watch TV all evening.\n\u0026hellip; (100 lines more) \u0026hellip;\nMake sure to finish your homework before we are back.\nCall if you want us to pick up anything for dinner.\nHow do you read this?\ncat: Opens entire note at once less: Opens content in a text viewer and lets you scroll up and down head: Peeks at the top few lines tail: Peeks at the bottom few lines My Suggestion:\nCommand Use Case cat Short file (Fits in one screen) less Long file (Log files or code) head Just checking the beginning tail Checking the latest update (End of logs) Note: less is a contents viewer. You can press q to close it. ESC won\u0026rsquo;t close the viewer\n# Read the first 2 lines $ head -2 note.txt Hey, we are leaving to pick up your cousin from the airport. You have a few things to do once you are back. # Read the last 2 lines $ tail -2 note.txt Make sure to finish your homework before we are back. Call if you want us to pick up anything for dinner. 5. Write It Down (Editors) # You want to write a reply. On the terminal, you can\u0026rsquo;t open Microsoft Word. You need terminal editors like nano or vim.\nAnd\u0026hellip; Let\u0026rsquo;s not talk about emacs now. I\u0026rsquo;m sorry if you are an emacs fan.\nOption 1: Nano (The Notepad) If you want a simple sticky note and a pen, use nano. It\u0026rsquo;s very beginner friendly.\n$ nano reply.txt You can simply type whatever you want. The short cuts are at the bottom.\n^ means Ctrl key. To Save: Press Ctrl + O (Write Out), then Enter. To Exit: Press Ctrl + X. Option 2: Vim (The Pro Tool) It is a powerful tool but a bit more tricky. The most important concept is Modes.\nNormal Mode: You cannot type text. You can view contents or give commands. Insert Mode: You can actually type and edit. How to survive inside Vim:\nType vim reply.txt. Press i to start typing (Insert Mode). When done, press Esc (to exit Insert Mode). Type :wq and Enter (Write and Quit). If you are stuck and panic? Press Esc and type :q! (Force Quit without saving). 6. Secret Tips # Let\u0026rsquo;s keep these tips between us.\n1. Tab Autocomplete (Magic Key) Don\u0026rsquo;t type long filenames manually. Just type the first few letters and hit TAB key.\n$ cd /home/my_family/first_f [TAB] # Becomes: $ cd /home/my_family/first_floor/ 2. History (Arrow Keys) Have you used the command before? Don\u0026rsquo;t retype the whole command. Just browse previously used commands with the Up/Down Arrow key and run it.\n3. The Abort Button (Ctrl+C) Stuck in a running program? Or typed a wrong command that you want to cancel? Press Ctrl + C.\n[user@linux]$ i_wrote_a_very_long_random_command [Ctrl+C] [user@linux]$ i_wrote_a_very_long_random_command^C [user@linux]$ (Canceled!) 4. The Clean Slate (clear) Is your screen too messy? Type clear. It wipes out the screen.\nThe Forbidden Spell One last warning. Don\u0026rsquo;t ever run this:\n$ rm -rf / This is the Nuke Button for your Linux world. It tries to delete everything from the root directory. Once launched, there is no going back.\nSummary\nNavigate: pwd (Where am I?), ls (What\u0026rsquo;s around here?), cd (Enter/Change location). Manage: mkdir (Create directory), touch (Create file), cp (Copy), mv (Move/Rename), rm (Destroy). View: cat (Short), less (Long), head/tail (Top/Bottom). Edit: nano (Simple), vim (Advanced). Survival: TAB to autocomplete, Ctrl+C to abort. Great job! You can now move comfortably in the darkness.\nHappy Computing!\n","date":"9 January 2026","externalUrl":null,"permalink":"/posts/linux101-01/","section":"Posts","summary":"Most people’s first experience with the Linux terminal is copying a command from somewhere and hoping it works. This post takes a different approach: explain what is actually happening, cover the commands you will use every single day, and make the terminal feel like a tool instead of a trap.","title":"[Linux 101] Linux Terminal for Beginners: Commands You Will Actually Use","type":"posts"},{"content":"","date":"9 January 2026","externalUrl":null,"permalink":"/series/linux-101/","section":"Series","summary":"","title":"Linux 101","type":"series"},{"content":"","date":"9 January 2026","externalUrl":null,"permalink":"/tags/terminal/","section":"Tags","summary":"","title":"Terminal","type":"tags"},{"content":"Your 50GB dataset is on your laptop. Your cluster is waiting.\nMoving files between your local machine (laptop/workstation) and the HPC cluster is a daily routine for researchers. You have your code, input data, and eventually, the results. This guide covers some basics for file transfer, from \u0026ldquo;packing\u0026rdquo; your files to handling massive datasets.\n*(Click the image to watch the tutorial on YouTube)* 1. The Golden Rule: Pack Before You Move # Think of this process like moving into a new house.\nIn the previous post, we compared the HPC cluster to a Hotel. Let\u0026rsquo;s assume your laptop is your old house. Now, you need to move your belongings (data) to the new place (HPC cluster).\nImagine you have 10,000 pairs of socks (small data files). Would you carry them one by one to the moving truck? No, since it\u0026rsquo;ll take forever, you would put them in a box first.\nIn HPC, transferring thousands of small files individually kills network performance due to overhead. So, you should always archive your files or folder first.\nChoose Your Box: Tar vs. Zip # # Packing (create archive) $ tar -czf my_data.tar.gz my_folder # -c: Create # -z: Gzip compression # -f: File name # Unpacking (extract archive) $ tar -xf my_data.tar.gz # -x: Extract # -f: File name # (On most modern systems, tar detects compression automatically) # Packing (create archive) $ zip -r my_data.zip my_folder # -r: Recursive (includes all subdirectories) # Unpacking (extract archive) $ unzip my_data.zip 2. Direct Download (Web to HPC) # Scenario: Your data is hosted on a website.\nDo not download it to your laptop just to upload it again to the cluster. That is an unnecessary double work. Just order your \u0026ldquo;delivery\u0026rdquo; directly to your new house (Cluster)!\nUse wget or curl on a cluster\u0026rsquo;s compute node (or a designated data transfer node, if your cluster provides one). Using a login node for a file transfer is usually not recommended.\n# Option 1: Using wget # wget \u0026lt;File Address\u0026gt; $ wget https://example.com/dataset.tar.gz # Option 2: Using curl # curl -o \u0026lt;File Name\u0026gt; \u0026lt;File Address\u0026gt; $ curl -o dataset.tar.gz https://example.com/dataset.tar.gz 3. Transfer Tools: SCP vs. Rsync # Scenario: The files are on your laptop. (Note: Run following commands on your Local Terminal, not inside the cluster)\nSCP (The \u0026ldquo;Simple Throw\u0026rdquo;) # If you have a small file or a single packed archive, use scp (Secure Copy). It is simple and quick.\n# Upload: Laptop -\u0026gt; Cluster $ scp my_data.tar.gz \u0026lt;USER\u0026gt;@\u0026lt;HOST_NAME\u0026gt;:~/ # Example: scp data.tar.gz user123@data.university.edu:~/ # Download: Cluster -\u0026gt; Laptop $ scp \u0026lt;USER\u0026gt;@\u0026lt;HOST_NAME\u0026gt;:~/results.tar.gz ./ # Example: scp user123@data.university.edu:~/data.tar.gz ./ Rsync (The \u0026ldquo;Smart Mover\u0026rdquo;) # What if your file is huge (e.g., 100GB) and your WiFi disconnects at 99%? scp will fail, and you have to start over again from 0%. That is going to be a nightmare.\nTry rsync instead. It checks the difference between source and destination. If the connection drops, it resumes from where it left off.\n$ rsync -azP my_big_data \u0026lt;USER\u0026gt;@\u0026lt;CLUSTER\u0026gt;:~/ # Example: rsync -azP data_tar.gz user123@data.university.edu:~/ Understanding the flags (-azP):\n-a: Archive mode. Preserves permissions, timestamps, and symbolic links. -z: Compress file data during the transfer for faster speed. -P: Shows Progress bar and allows Partial transfer (Resuming). Rule of Thumb:\nSmall file or Simple transfer? Use SCP Big file or Unstable network? Use Rsync 4. GUI Clients (WinSCP \u0026amp; FileZilla) # \u0026ldquo;I hate the terminal. Can I just drag and drop?\u0026rdquo;\nYes, you can! If you are not comfortable with command-line tools yet, or if you just want to browse files visually, use an SFTP Client.\nRecommended Tools # Windows only: WinSCP (Most popular) Windows/Mac/Linux: FileZilla or Cyberduck How to Connect # The settings are exactly the same as your SSH connection.\nFile Protocol: SFTP Host name: Your cluster address (e.g., data.university.edu) Port number: 22 (Default SSH port) User/Password: Your credentials Once connected, you will see your laptop\u0026rsquo;s files on the left and the cluster\u0026rsquo;s files on the right. Just drag and drop to transfer!\nNote for Globus Users: If you need to transfer massive datasets (Terabytes/Petabytes) between institutions or clusters, ask your system administrator about Globus. It is a high-performance transfer service often supported by research centers. It\u0026rsquo;s much faster and more reliable than SCP/SFTP for large data.\n5. Code Management with Git # Scenario: Moving your Python/C++ scripts.\nShould I use rsync for your code? You can, but why not try a better method. Treat your code like books in a library. You can keep old books while adding new editions and check them out. Try to use Git.\nLaptop: Commit and push your code to GitHub/GitLab.\n# Commit your changes $ git commit -a -m \u0026#34;Commit Message\u0026#34; # Push your changes to github $ git push Cluster: Clone or Pull the repository.\n# On the Cluster $ git clone https://github.com/username/my-project.git # Pull changes $ git pull This keeps your version history safe and makes collaboration much easier.\n6. Storage Quota # Warning: Remember the \u0026ldquo;Hotel Room\u0026rdquo; analogy? Your room has an occupancy limit. We call it Quota.\nIf you fill up your disk space, your jobs will crash, and you might not be able to save a file or cannot even login.\nHow to check? Commands vary by institution. Common examples include:\n$ quota -s $ lfs quota -u user123 /home/user123 $ check_usage Please check your user documentation or ask your support team for the specific command. Always check your available space before transferring a massive dataset.\nSummary\nPack your small files (tar or zip). Use wget for web data. Use scp for quick, small transfers. Use rsync -azP for large, robust transfers. Use git for code. Nice job! You have learned how to prepare your data. In the next post, we will learn how to manage software environments using Conda.\nHappy Computing!\n","date":"2 January 2026","externalUrl":null,"permalink":"/posts/hpc101-02/","section":"Posts","summary":"Copying files to a remote cluster is easy to get wrong. A basic scp works fine until your dataset is large and the connection drops halfway through. This post covers scp for quick transfers, rsync for large or resumable jobs, and git for code that belongs in version control anyway.","title":"[HPC 101] File Transfer on HPC: SCP, Rsync, and Git Explained","type":"posts"},{"content":"","date":"2 January 2026","externalUrl":null,"permalink":"/tags/file-transfer/","section":"Tags","summary":"","title":"File Transfer","type":"tags"},{"content":"","date":"2 January 2026","externalUrl":null,"permalink":"/tags/git/","section":"Tags","summary":"","title":"Git","type":"tags"},{"content":"","date":"2 January 2026","externalUrl":null,"permalink":"/tags/rsync/","section":"Tags","summary":"","title":"Rsync","type":"tags"},{"content":"","date":"2 January 2026","externalUrl":null,"permalink":"/tags/scp/","section":"Tags","summary":"","title":"SCP","type":"tags"},{"content":"You have cluster access. The terminal is open. Now what?\nYou just got cluster access. There is a hostname, a username, and a blank terminal. This post gets you to your first real job: SSH to get in, the module system to load software, and Slurm to run something on an actual compute node.\n1. What is HPC? # High-Performance Computing (HPC) utilizes supercomputers or computer clusters to solve complex computational problems. While a standard workstation can handle everyday tasks, HPC is designed for massive scale, widely used in fields ranging from engineering and science to finance and psychology. It is a rapidly growing technology, especially in the age of AI and Machine Learning.\nResearch institutes and companies around the world use HPC to develop new products or run intensive simulations. One of the world’s fastest HPC systems, El Capitan, is hosted by Lawrence Livermore National Laboratory. (Reference).\nWhy do we use HPC? # HPC is a powerful tool that allows researchers and engineers to solve problems demanding high computational performance which cannot be handled by normal desktop PCs. Here are some example cases,\nAI/ML: Training large models using multiple GPUs Pharmaceutics: Simulating molecular dynamics to develop new medicines Physics/Chemistry: Running quantum chemistry or simulating protein folding Meteorology: Processing large data for accurate weather forecasting 2. How to SSH into an HPC Cluster # Before we compute, we need to connect to the cluster. Watch the tutorial video below or follow the text guide below.\n*(Click the image to watch the tutorial on YouTube)* What is SSH? # SSH (Secure Shell) is a network protocol that enables secure connections between computers. It is used for remote access, command execution, and file transfers. Don\u0026rsquo;t worry if these terms sound too technical. Simply, think of it as a secure tunnel connecting your PC to the HPC cluster.\nLet\u0026rsquo;s connect! # Open a terminal window.\nLinux/Mac: Open the built-in Terminal app Windows: Use Command Prompt (CMD), PowerShell, or third party tools like PuTTY or MobaXterm Type the following command:\n$ ssh \u0026lt;YOUR_ID\u0026gt;@\u0026lt;CLUSTER_HOST_NAME\u0026gt; # Example: $ ssh user123@login.university.edu (Note: The $ sign indicates the command-line prompt. Do not type it.)\nSecurity Prompt: If this is your first time connecting, you will see a message asking: \u0026ldquo;Are you sure you want to continue connecting?\u0026rdquo; Type yes and press Enter.\nEnter Password:\nType your user password.\nNote: You will NOT see asterisks (****) or a cursor moving. This is a standard security feature in Linux. Just type your password and press Enter.\nSuccess:\nIf you see a screen similar to the one below, you have successfully logged in!\n[user123@login-01 ~]$ 3. How to use Modules # On HPC, you can’t simply install software with sudo apt-get or sudo dnf. Instead, we use the Module System.\n*(Click the image to watch the tutorial on YouTube)* What is the Module System? # Most HPC clusters manage software using a module system like Environment Modules or Lmod. Unlike your personal computer where you can install software on system, HPC clusters use modules to:\nNo Conflicts: Different users can use different software versions simultaneously Reproducibility: You can keep your environment consistent for your research Auto-loading: When you load a module (e.g., OpenMPI), it automatically loads necessary dependencies (e.g., GCC compilers) Essential Commands # Here is a cheat sheet for module commands:\n# View list of ALL available modules on the system $ module avail # Load a specific module $ module load \u0026lt;NAME\u0026gt;/\u0026lt;VERSION\u0026gt; # Example: module load openmpi/4.1.8 # View list of CURRENTLY loaded modules $ module list # Unload a module $ module unload \u0026lt;NAME\u0026gt; # Unload ALL modules $ module purge Recommended Practices # Avoid .bashrc: Do not put module load commands in your .bashrc file. This could cause conflicts and login issues. Check availability first: Use module avail to see the exact name and version. Be specific: Always specify the version number (e.g., module load openmpi/4.1.8). If not specified, the default version is loaded, which might be changed. 4. Submit Your First Job with Slurm # Now, you\u0026rsquo;re ready to submit a job.\n*(Click the image to watch the tutorial on YouTube)* What is a Job Scheduler? # In an HPC environment, you do not run heavy calculations directly on the Login Node. Instead, you submit a \u0026ldquo;job\u0026rdquo; to a Scheduler like Slurm, PBS, SGE, or LSF. The scheduler manages resources and assigns your job to available Compute Nodes.\nNote: This tutorial primarily focuses on Slurm, one of the most widely used schedulers in modern HPC systems. PBS/Torque examples are provided for reference, but commands and options may vary. Always check your cluster\u0026rsquo;s documentation for scheduler-specific syntax.\nInteractive Jobs: Useful for development, debugging, or tasks requiring a GUI. You get a shell on a compute node. Batch Jobs: Useful for long running tasks. You submit a script, and the system runs it when resources are available. The \u0026ldquo;Hotel\u0026rdquo; Analogy # Sometimes beginners make a mistake of running heavy tasks directly after logging in. Please don’t do that.\nThink of the HPC cluster as a Hotel.\nLogin Node = Hotel Lobby: This is where you check in. It’s a shared space. You wouldn’t set up a tent and sleep in the lobby, right? Compute Node = Guest Room: This is your private room where you can actually work (sleep). Scheduler = Receptionist: You ask the receptionist (Scheduler) for a room (Resources), and they assign you one. We use a job scheduler like Slurm to ask for resources.\nLet\u0026rsquo;s submit an Interactive Job # Use this when you need to test or debug code in real-time.\nRequest a session (get a room):\n[user123@login-01]$ srun --pty bash srun: job 12345 queued and waiting for resources srun: job 12345 has been allocated resources [user123@compute-01]$ # Note: your cluster may require specifying partition: # $ srun -p interactive --pty bash Your hostname will change from login-01 to compute-01. You are now in your “Guest Room”. When you are done, type exit to return to the login node (lobby):\n[user123@compute-01 ~]$ exit [user123@login-01 ~]$ Let\u0026rsquo;s submit a Batch Job # This is for long-running simulations. You write a \u0026ldquo;batch script\u0026rdquo; (reservation request) and submit it.\nCreate a script (e.g., job_script.sh) using a text editor like vim or nano. #!/bin/bash # Tells the system that this is a Bash script #SBATCH --account=myAcct # Account name #SBATCH --partition=myPart # Partition name #SBATCH --job-name=first_job # Job name #SBATCH --output=result.out # Standard output log #SBATCH --error=result.err # Standard error log #SBATCH --nodes=1 # Number of nodes #SBATCH --ntasks=1 # Number of tasks (processes) #SBATCH --time=00:10:00 # Time limit (HH:MM:SS) #SBATCH --mem-per-cpu=4G # Memory per cpu # Load necessary modules module load python/3.12.12 # Run your command echo \u0026#34;Hello, HPC World!\u0026#34; python3 --version #!/bin/bash # Tells the system that this is a Bash script #PBS -A myAcct # Account name #PBS -q myQueue # Queue name #PBS -N first_job # Job name #PBS -o result.out # Standard output log #PBS -e result.err # Standard error log #PBS -l nodes=1:ppn=1 # Number of nodes and processors per node #PBS -l walltime=00:10:00 # Time limit (HH:MM:SS) #PBS -l pmem=4gb # Memory per cpu # Load necessary modules module load python/3.12.12 # Change to submission directory cd $PBS_O_WORKDIR # Run your command echo \u0026#34;Hello, HPC World!\u0026#34; python3 --version Notes: Make sure to modify the script to meet your requirements\n(Important: Replace \u0026ldquo;myAcct\u0026rdquo; and \u0026ldquo;myPart\u0026rdquo; with your actual account and partition names provided by your system administrator.) #SBATCH: Slurm directives readable to Slurm scheduler\n(#SBATCH is one word not \u0026ldquo;# SBATCH\u0026rdquo;) Actual tasks located under Slurm directives Your job will get terminated once your tasks are done\n(in case you submitted a longer time than required) Submit the job: $ sbatch job_script.sh Submitted batch job 12345 $ qsub job_script.sh 12345.headnode (Remember this Job ID (12345) and reference this number in your ticket!)\nCheck the status: $ squeue --me JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 12345 myPart first_job user123 R 0:02 1 compute-01 $ qstat -u user123 Job ID Name User Time Use S Queue -------- -------- -------- -------- - ----- 12345 first_job user123 0:02 R myQueue Job Status Columns (Slurm):\nColumn Description JOBID Your Job\u0026rsquo;s assigned ID PARTITION Partition name NAME Job name USER User name ST Job status: R=Running, PD=Pending, F=Failed, S=Suspended, CG=Completing TIME Time elapsed since job started NODES Number of requested nodes In case you want to cancel the job, use scancel \u0026lt;JOBID\u0026gt; or qdel \u0026lt;JOBID\u0026gt;\n$ scancel 12345 $ qdel 12345 View results: Once the job finishes (or disappears from squeue), check the output file:\n# Success log $ cat result.out Hello, HPC World! Python 3.12.12 # Error log (If something went wrong) $ cat result.err Summary\nSSH: The secure tunnel to enter the cluster Modules: Load software Login Node (Lobby): Only for checking in Compute Node (Room): The actual place to run your work, assigned by the Scheduler Job Submission: Use sbatch for batch scripts and srun for interactive job Congratulations! You have successfully logged in, set up your environment, and run your first job. In the next post, we will move our bags (Data) to this new hotel room.\nNeed Help?\nCheck your cluster\u0026rsquo;s documentation for specific Slurm configurations Use man sbatch to see all available options Most clusters have a help channel or support email Happy Computing!\n","date":"27 December 2025","externalUrl":null,"permalink":"/posts/hpc101-01/","section":"Posts","summary":"Every HPC user starts somewhere, and it usually involves staring at a login prompt with no idea what to do next. This post covers the three things you need on day one: SSH access, loading software with the module system, and submitting your first Slurm batch job.","title":"[HPC 101] Getting Started on HPC: SSH Login, Module System, and First Slurm Job","type":"posts"},{"content":"","date":"27 December 2025","externalUrl":null,"permalink":"/tags/ssh/","section":"Tags","summary":"","title":"SSH","type":"tags"},{"content":"","date":"27 December 2025","externalUrl":null,"permalink":"/tags/tutorial/","section":"Tags","summary":"","title":"Tutorial","type":"tags"},{"content":"You can SSH into a cluster. But what happens between the login prompt and a job running across 100 GPUs?\nThat gap is what I want to document.\nI\u0026rsquo;m Will Paik. I work as an HPC Machine Learning Performance Engineer at Northeastern University, where my job is bridging the tension between sysadmin priorities (\u0026ldquo;keep it stable\u0026rdquo;) and researcher priorities (\u0026ldquo;run it faster, right now\u0026rdquo;). I\u0026rsquo;ve been doing some version of that for nine years, first at Penn State and now at Northeastern.\nI\u0026rsquo;m also building a 6-node HPC cluster at home from consumer hardware. Not because I need one. Because building a system from scratch teaches you things that years of using other people\u0026rsquo;s systems do not.\n1. Why Start This Blog # Most HPC documentation lives at one of two extremes: enterprise guides that assume a dedicated IT staff and a six-figure budget, or Stack Overflow threads that solve one specific problem without explaining why it happened. There is not much for someone who wants to understand how the whole system fits together.\nThe second gap is at the intersection of HPC and ML. Most ML practitioners know how to write a training script. Most HPC practitioners know how to manage a cluster. Very few resources explain the space in between: what you need to understand when you are running distributed training on real infrastructure and debugging problems that span the scheduler, the network, the filesystem, and the model at the same time.\n2. What I Am Thinking of Doing # For now, I am thinking about practical guides for researchers who are new to shared HPC clusters. Not \u0026ldquo;here is the sbatch man page.\u0026rdquo; More like: here is what you need to know to actually get work done without accidentally wiping your home directory or getting your account suspended.\nAfter that, I plan to document the home cluster build end to end. Every hardware decision, every config file, every mistake. Networking, storage, job scheduling, Ansible automation, and eventually GPU workloads. I want the posts to read like a real build log, not a tutorial written after everything already worked.\nWhether that leads into ML infrastructure, distributed training, cluster benchmarking, or something else, I will figure that out as I go.\n3. How I Am Approaching It # Real hardware. Real output. Not just the final working version, but the failures and the fixes that got there. The cluster I am building has real constraints: consumer CPUs, consumer networking, a gaming PC as a GPU node. The solutions have to work in that context, which means they should be useful for researchers working with similarly constrained infrastructure.\nEverything will be bilingual where possible. English and Korean posts will cover the same content.\n4. Where to Follow Along # Videos for the cluster build episodes will go on The Login Node YouTube channel. Code and config files will go on GitHub as the projects develop.\nHappy Computing!\n","date":"16 December 2025","externalUrl":null,"permalink":"/posts/hello-world/","section":"Posts","summary":"The first post on The Login Node. Will Paik explains why he started the blog, what direction he is planning to take it, and how he is approaching it.","title":"Hello World: What This Blog Is and Why It Exists","type":"posts"},{"content":"","date":"16 December 2025","externalUrl":null,"permalink":"/tags/introduction/","section":"Tags","summary":"","title":"Introduction","type":"tags"},{"content":"Hi, I\u0026rsquo;m Will Paik. Welcome to The Login Node.\nI\u0026rsquo;m an HPC Performance Engineer specializing in optimizing large-scale GPU clusters for AI/ML workloads. In supercomputing, there\u0026rsquo;s always a natural tension between system administrators (\u0026ldquo;Keep it stable!\u0026rdquo;) and researchers (\u0026ldquo;Run it faster!\u0026rdquo;). My job is to find the technical sweet spot that makes both happy.\nDuring the day I work on production HPC infrastructure for AI research. Outside of work, I build a mini-supercomputer from consumer hardware and document every step of it here.\nCORE STACK: Slurm Linux Docker/Apptainer PyTorch Distributed Ansible\nWhat You\u0026rsquo;ll Find Here # The Login Node is an HPC and ML infrastructure engineering blog aimed at people who want to understand how the underlying systems actually work \u0026ndash; not just how to submit a job and wait.\nContent is organized into three series:\n🔧 HPC From Scratch \u0026ndash; Building a real 6-node cluster from consumer hardware under $1,300. Hardware selection, OS install, networking, Slurm, Ansible, and GPU workloads. Start here. 🎓 HPC 101 \u0026ndash; SSH, module systems, Slurm fundamentals, and job debugging. For researchers new to HPC. Start here. 🐧 Linux 101 \u0026ndash; Terminal basics for people who find the command line intimidating. Start here. My Home Cluster # Role Hardware Specs Login Node Lenovo IdeaPad 1 Ryzen 5 7520U, 8GB RAM Management Lenovo ThinkCentre M715q Ryzen 5 2400GE, 16GB RAM Visualization Lenovo ThinkCentre M715q Ryzen 5 2400GE, 16GB RAM Worker Nodes (x2) Lenovo ThinkCentre M715q Ryzen 5 2400GE, 16GB RAM GPU Node HP Envy TE01 Core i7-10700F, 32GB RAM, GTX 1660 Super Storage (via Management) 1TB NVMe SSD (NFS) Network Gigabit Managed Switch 8-port, VLAN support Software stack: Rocky Linux 10, Slurm 25, Ansible, Apptainer, Prometheus + Grafana (in progress)\nBackground # I hold a PhD in Aerospace Engineering from Penn State with a minor in Computational Science, and spent 8 years there supporting 500+ researchers before moving to Northeastern University. The astrodynamics background informs how I think about large-scale optimization problems which I just applied to GPU clusters instead of spacecraft trajectories.\nFor the full professional history, see the Career page.\nGet in Touch # GitHub LinkedIn YouTube ","date":"1 January 2025","externalUrl":null,"permalink":"/about/","section":"","summary":"","title":"About","type":"page"},{"content":"Date: 2015–2016 Institution: Pennsylvania State University, University Park, PA\nI served as a mentor and instructor for engineering undergraduates, focusing on computational methods and programming logic.\nAerospace Analysis: Assisted students with numerical methods and engineering analysis. Programming for Engineers: Mentored students on MATLAB programming logic and algorithm development. (This entry archives past academic teaching experience at Penn State University.)\n","date":"1 January 2015","externalUrl":null,"permalink":"/talks/teaching-psu/","section":"Talks \u0026 Workshops","summary":"","title":"Academic Teaching Experience (2015–2016)","type":"talks"},{"content":"","date":"1 January 2015","externalUrl":null,"permalink":"/tags/teaching/","section":"Tags","summary":"","title":"Teaching","type":"tags"},{"content":"","externalUrl":null,"permalink":"/authors/","section":"Authors","summary":"","title":"Authors","type":"authors"},{"content":"","externalUrl":null,"permalink":"/categories/","section":"Categories","summary":"","title":"Categories","type":"categories"},{"content":"HPC Engineer with 9+ years of experience in system architecture and performance optimization for scientific and AI/ML workloads. Focus on translating hardware capabilities into real-world application performance, particularly for large-scale AI/ML training. Experienced in supporting diverse research needs across university clusters, statewide research infrastructure, and custom-built systems.\nExperience # HPC Machine Learning Performance Engineer Research Computing, Northeastern University | Jan 2025 – Present\nOptimizing distributed ML workloads on production HPC clusters Performance analysis and benchmarking for AI/ML applications Supporting researchers with computational workflow optimization across multiple disciplines GPU Training Benchmarking, AICR Benchmarking Group Massachusetts AI Computing Resource (AICR) | 2026\nContributed to statewide AI research infrastructure benchmarking initiative (independent of Northeastern) Responsible for GPU training workload benchmarking within a broader cross-institutional effort Evaluation targets include B200 and RTX Pro 6000 clusters across multi-node configurations HPC Software Consultant Institute for Computational and Data Sciences, Penn State University | Jan 2017 – Dec 2024\n8 years supporting 500+ researchers across multiple disciplines Cluster performance optimization and user support Developed containerized environments for reproducible research (Singularity Hub contributor) System monitoring, resource allocation, and job optimization Parallel Computing Support Application Engineer (Internship) MathWorks | Summer 2021\nOptimized parallel computing toolboxes for MATLAB Developed performance benchmarks for distributed computing Created documentation for HPC integration with MATLAB Technical Skills # Category Tools Schedulers Slurm, PBS (job arrays, dependency chains, resource optimization) Parallel Computing MPI (OpenMPI, Intel MPI), OpenMP, CUDA Storage NFS, parallel filesystems, data management strategies Containerization Singularity/Apptainer, Docker, Podman Automation Ansible, Bash scripting, system provisioning Monitoring Prometheus, Grafana, performance metrics Languages Python, C/C++, Fortran, MATLAB, Shell scripting Version Control Git, GitLab CI/CD Performance Profiling, optimization, bottleneck analysis Projects # Project Description HPC From Scratch 6-node cluster from consumer hardware. Slurm, Ansible, NFS, FreeIPA, Lmod. PyTorch DDP Benchmark Multi-GPU/multi-node distributed training scaling benchmark for HPC clusters. [GitHub] pkg_audit RPM package consistency audit tool with Slurm partition sweep and Ansible remediation. [GitHub] 4D LiDAR SLAM Optimization Parallelized point cloud processing for real-time ROS 2 performance Side Projects (upcoming) Game of Life web app, browser poker, ESP32-P4 thermal camera Other Engineering Projects NIST First Responder UAS Indoor Challenge (2022) Award: 3rd place + First Responder\u0026rsquo;s Choice (Prize: $80,000) Custom quadcopter for GPS-denied indoor emergency scenarios. [nist.gov]\nVFS Design-Build-Vertical Flight Student Competition (2021 and 2022) Award: 3rd place (2022), 1st place in preliminary reports + Best Computational Simulation Award (2021). [engr.psu.edu]\n9th and 10th ESA Global Trajectory Optimization Competition (2017 and 2019) Developed parallel algorithms for complex trajectory optimization. [psu.edu]\nEducation # The Pennsylvania State University\nDegree Year Notes PhD, Aerospace Engineering 2024 Minor in Computational Science. Dissertation: Multiple Gravity-Assist Trajectory Design with Continuous-Thrust Synergetic Maneuvers MS, Aerospace Engineering 2015 Minor in Computational Science. Thesis: Optimal Orbit Raising Via Particle Swarm Optimization BS, Aerospace Engineering 2013 Talks \u0026amp; Workshops # See the Talks page for details.\nTalk Venue Year Introduction to Parallel Computing Northeastern University Spring 2026 Linux Essentials for HPC Researchers Northeastern University Spring 2026 Teaching Assistant \u0026ndash; Aerospace Analysis, Programming for Engineers Penn State Unversity 2015–2016 ","externalUrl":null,"permalink":"/cv/","section":"","summary":"","title":"CV","type":"page"}]