Draft Forbes Group Website (Build by Nikola). The official site is hosted at:
License: GPL3
ubuntu2004
Kamiak Cluster at WSU
Here we document our experience using the Kamiak HPC cluster at WSU.
Resources
Kamiak Specific
Kamiak Users Guide: Read this.
Service Requests: Request access to Kamiak here and use this for other service requests (software installation, issues with the cluster, etc.)
Queue List: List of queues.
General
SLURM: Main documentation for the current job scheduler.
Lmod: Environment module system.
Conda: Package manager for python and other software.
TL;DR
If you have read everything below, then you can use this job script.
Notes:
Make sure that you can clone everything without an SSH agent. (I.e. any pip-installable packages.)
Python on a Single Node
If you are running only on a single node, then it make sense to create an environment that uses a /local
scratch space since this is the fastest sort of storage available. Here we create the environment in our SLURM script storing the location in my_workspace
.
Overview
Using the cluster requires understanding the following components:
Obtaining Access
Request access by submitting a service request. Identify your advisor/supervisor.
Connecting
To connect to the cluster, use SSH. I recommend generating and installing an SSH key so you can connect without a password.
Jobs and Queues
All activity – including development, software installation, etc. – must be run on the compute nodes. You gain access to these by submitting a job to the appropriate job queue (scheduled with SLURM). There are three types of jobs:
Dedicated jobs: If you or your supervisor own nodes on the system, you can submit jobs to the appropriate queue and gain full access to these, kicking anyone else off. Once you have access to your nodes, you can do what you like. An example would be the CAS queue
cas
.Backfill jobs: The default is to submit a job to the Backfill queue
kamiak
. These will run on whatever nodes are not occupied, but can be preempted by the owners of the nodes. For this reason, you must implement a checkpoint-restart mechanism in your code so you can pickup where you left off when you get preempted.
On top of these, you can choose either background jobs (for computation) or interactive jobs (for development and testing).
Resources
When you submit a job, you must know:
How many nodes you need.
How many processes you will run.
Roughly how much memory you will need.
How long your job will take.
Make sure that your actual usage matches your request. To do this you must profile your code. Understand the expected memory and time usage before you run, then actually test this to make sure your code is doing what you expect. If you exceed the requested resources, you may slow down the cluster for other users. E.g. launching more processes than there are threads on a node will cause thread contention, significantly impacting the performance of your program and that of others.
Nodes are a shared resource - request only what you need and do not use more than you request.
Software
Much of the software on the system is managed by the Lmod module system. Custom software can be installed by sending service requests, or built in your own account. I maintain an up-to-date conda installation and various environments.
Preliminary
SSH
To connect to the cluster, I recommend configuring your local SSH server with something like this. (Change m.forbes
to your username!)
This will allow you to connect with ssh kamiak
rather than ssh [email protected]
. Then use ssh-keygen
to create a key and copy it to kamiak:~/.ssh/authorized_keys
. The second entry allows you to directly connect to the compute nodes, forwarding ports so you can run Jupyter notebooks. Only do this for nodes for which you have been granted control through the scheduler.
Interactive Queue
Before doing any work, be sure to start an interactive session on one of the nodes. (Do not do work on the login nodes, this is a violation of the Kamiak user policy.) Once you have tested and profiled your code, run it with a non-interactive job in the batch queue.
Home Setup
I have included the following setup. This will cause your ~/.bashrc
file to load some environmental variables, and create links to the data directory.
If you do not have a .bashrc
file, then you can copy mine and similar related files.
If you do have one, then you can append these commands using cat
:
In addition to this, you want to make sure that your .bashrc
file loads any required modules that might be needed by default. For example. If you want to be able to hg push
code to Kamiak, you will need to ensure that an appropriate module is loaded with mercurial. This can be done with the conda
module below which is what I do above.
Make sure you add your username to the .hgrc
file, create it:
Conda
I do not have a good solution yet for working with Conda on Kamiak. Here are some goals and issues:
Goals
Allow users to work with custom environments ensuring reproducible computing.
Allow users to install software using
conda
. (The other option is to usepip
, but I am migrating to make sure all of my packages are available on mymforbes
anaconda channel.
Issues
Working with conda in the user's home directory (default) or on
/scratch
is very slow. For some timings, installing a minimal python3 two times in succession (so that the second time needs no downloads). We also compare the time required to copy the environment to the Home directory, and the time it takes to runrm -r pkgs envs
:
Location | Fresh Install | Second Install | Copy Home | Removal |
---|---|---|---|---|
Home | 3m32s | 1m00s | N/A | 1m03s |
Scratch | 2m16s | 0m35s | 2m53s | 0m45s |
Local | 0m46s | 0m11s | 1m05s | 0m00s |
Recommendation
If you need a custom environment, use the Local drive
/local
and build it at the start of your job. A full anaconda installation takes about 5m24s on/local
.If you need a persistent environment, build it in your Home directory, but keep the
pkgs
directory on Scratch or Local to avoid exceeding your quota. (Note: conda environments are not relocatable, so you can't just copy the one you built on Local to your home directory. With the copy speeds, it is faster just to build the environment again.)
Playing with Folders
We will need to manage our own environment so we can install appropriate versions of the python software stack. In principle this should be possible with Anaconda 4.4 (see this issue – Better support for conda envs accessed by multiple users – for example), but Kamiak does not yet have this version of Conda. Untill then, we maintain our own stack.
Conda Root Installation
We do this under our lab partition /data/forbes/apps/conda
so that others in our group can share these environments. To use these do the following:
module load conda
: This will allow you to use our conda installation.conda activate
: This activates the base environment withhg
andgit-annex
.conda env list
: This will show you which environments are available. Choose the appropriate one and then:conda activate --stack <env>
: This will activate the specified environment, stacking this on top of the base environment so that you can continue to usehg
andgit-annex
.conda deactivate
: Do this a couple of times when you are done to deactivate your environments.module unload conda
: Optionally, unload the conda module.
Note: you do not need to use the undocumented --stack
feature for just running code: conda activate <env>
will be fine.
Primary Conda Environments (OLD)
Once these base environments are installed, we lock the directories so that they cannot be changed accidentally.
To use python, first load the module of your choice:
Now you can create an environment in which to update everything.
Now you can activate work3
and update anaconda etc.
Some files are installed, but most are linked so this does not create much of a burden.
Issues
The currently recommended approach for setting up conda is to source the file .../conda/etc/profile.d/conda.sh
. This does not work well with the module system, so I had to write a custom module file that does what this file does. This may get better in the future if the following issues are dealt with:
#6820: Consider shell-agnostic activate.d/deactivate.d mechanism: This one even suggests using Lmod for activation.
#7407: Some conda environment variables are not being unset when you deactivate the virtual environment: Closed, but references issue #7609.
#7609: add conda deactivate --all flag: Might not help.
References
Conda Docs: Multi-User support: It seems like the Kamiak installations do not use a top-level
.condarc
file.Issue 1329: Better support for conda envs accessed by multiple users.Constructor Issue 145:
conda --clone
surprised me by downloading a stack of files.
Inspecting the Cluster
Sometimes you might want to see what is happening with the cluster and various jobs.
Queue
To see what jobs have been submitted use the squeue
command.
Nodes
Suppose you are running on a node and performance seems to be poor. It might be that you are overusing the resources you have requested. To see this, you can log into the node and use the top
command. For example:
This tells us that I have 1 jobs running on note cn94
which requested 5 CPUs, while user l...
is running 4 jobs having requested a total of 20 CPUs, and user e...
is running 2 jobs, having requested 1 CPU each. (Note: to see the number of CPUs, I needed to manually adjust the format string as described in the manual.)
Node Capabilities
To see what the compute capabilities of the node are, you can use the lscpu
command:
This tells us sum information about the node, including that there are 14 cores per socket and 2 sockets, for a total of 28 cores on the node, so the 27 requested CPUs above should run fine.
Node Usage
To see what is actually happening on the node, we can log in and run top:
Here I am just looking with top
, but the other users are running 13 processes that are each using a full CPU on the node. The 3.6% = 1/28, since the node has 28 CPUs. (To see this view, you might have to press "Shift-I" while running top to disable Irix mode. If you want to save this as the default, press "Shift-W" which will write the defaults to your ~/.toprc
file.)
Note: there are several key-stroke commands you can use while running top
to adjust the display. When two options are available, the lower-case version affects the listing below for each process, while the upper-case version affects the top summary line:
e/E
: Changes the memory units.I
: Irix mode - toggles between CPU usage as a % of node capability vs as a % of CPU capability.
Software
Modules
To find out which modules exist, run module avail
:
You can alo use module spider
for searching. For example, to find all the modules related to conda you could run:
To inspect the actual module file (for example, if you would like to make your own based on this) you can use the module show
command:
Running Jobs
Before you consider running a job, you need to profile your code to determine the following:
How many nodes and how many cores-per-node do you need?
How much memory do you need per node?
How long will your program run?
What modules do you need to load to run your code?
What packages need to be installed to run your code?
Once you have this information, make sure that your code is committed to a repository, then clone this repository to Kamiak. Whenever you perform a serious calculation, you should make sure you are running from a clean checkout of a repository with a well-defined set of libraries installed so that your runs are reproducible. This information should be stored along side your data so that you know exactly what version of your code produced the data.
Here are my recommended steps.
Run an interactive session.
Log in directly to node so agent get forwarded.
Checkout your code into a repository.
Link your run folder to
~/now
.Make a SLURM file in
~/runs
.
Issues
Interactive Jobs do not ForwardAgent
Jupyter Notebook: Tunnel not working
For some reason, trying to tunnel to compute nodes is failing. It might be administrative settings disallow TCP through tunnels, or it might be something with the multi-hop.
Mercurial and Conda
I tried the usual approach of putting mercurial in the conda base
environment, but when running conda, mercurial cannot be found. Instead, one needs to load the mercurial module. I need to see if this will work with with mmfhg
.
Permissions
Building and Installing Software
The following describes how I have built and installed various pieces of software. You should not do this - just use the software as described above. However, this information may be useful if you need to install your own software.
Conda
Our base conda environment is based on the mforbes/base environment and includes:
Mercurial, with topics and the hg-git bridge.
Black
Anaconda Project
Poetry
mmf-setup
nox and nox-poetry
To create and update environments:
conda.lua
MyRepos
mmfhg
To Do
mmfhg mmfutils mmf_setup hgrc mr gitannex get these working
Questions
Kamiak
How to forward and ssh port to a compute node?
How to use slurm script to configure environment and for interactive sessions?
Conda: best way to setup environments?
Some options:
Install environments on local scratch directory (only good for single node jobs).
Install into
~
but redirect conda package dir to local or scratch. (Makes sure we can use current package.)Install in global scratch which is good for 2 weeks.
Cloning the base environment? In principle this should allow one to reuse much of the installed material, but in practice it seems like everything gets downloaded again.
First remove my conda stuff from my
.bashrc
file.Initial attempt. Install a package that is not in the installed anaconda distribution:
Try creating a clone environment with
conda create -n mmf --clone base
. This is not a good option as kit downloads a ton of stuff into~/.conda/envs/mmf
and~/.conda/pkgs
.
Stack on top of another environment? This is an undocumented feature that allows you to stack environments. After playing with it a bit, however, it seems like it would only be useful for different applications, not for augmenting a python library.
This fails because it does not install python. The previous python is used and it cannot see the new uncertainties package.
Presumably people can update software 2.
Currently it seems I need to use my own conda (until anaconda 4.4.0)
Programming
How to profile simple GPU code?
Investigations
Here we include some experiments run on Kamiak to see how long various things take. These results may change as the system undergoes transformations, so this information may be out of date.
Conda
Here we investigate the timing of creating some conda environments using the user's home directory vs /scratch
, vs /local
:
Home
From this we see that there is some space saving from the use of hard-links. Note that the packages also take up quite a bit of space.