Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download

Draft Forbes Group Website (Build by Nikola). The official site is hosted at:

https://labs.wsu.edu/forbes

5913 views
License: GPL3
ubuntu2004
Kernel: Python 2 (Ubuntu, plain)

Kamiak Cluster at WSU

Here we document our experience using the Kamiak HPC cluster at WSU.

Resources

Kamiak Specific

General

  • SLURM: Main documentation for the current job scheduler.

  • Lmod: Environment module system.

  • Conda: Package manager for python and other software.

Table of Contents

TL;DR

If you have read everything below, then you can use this job script.

Notes:

  • Make sure that you can clone everything without an SSH agent. (I.e. any pip-installable packages.)

Python on a Single Node

If you are running only on a single node, then it make sense to create an environment that uses a /local scratch space since this is the fastest sort of storage available. Here we create the environment in our SLURM script storing the location in my_workspace.

#!/bin/bash #SBATCH -n 1 # Number of cores #SBATCH -t 0-00:10 # Runtime in D-HH:MM # Local workspace for install environments. # This will be removed at the end of the job. my_workspace="$(mkworkspace --backend=/local --quiet)" function clean_up { # Clean up. Remove temporary workspaces and the like. rm -rf "${my_workspace}" exit } trap 'clean_up' EXIT # TODO: Why does hg-conda not work here? module load conda mercurial conda activate base # TODO: Make this in /scratch for long-term use export CONDA_PKGS_DIRS="${my_workspace}/.conda" conda_prefix="${my_workspace}/current_conda_env" #conda env create -f environment.yml --prefix "${conda_prefix}" mamba env create -q -f environment.yml --prefix "${conda_prefix}" conda activate "${conda_prefix}" ... # Do your work.

Overview

Using the cluster requires understanding the following components:

Obtaining Access

Request access by submitting a service request. Identify your advisor/supervisor.

Connecting

To connect to the cluster, use SSH. I recommend generating and installing an SSH key so you can connect without a password.

Jobs and Queues

All activity – including development, software installation, etc. – must be run on the compute nodes. You gain access to these by submitting a job to the appropriate job queue (scheduled with SLURM). There are three types of jobs:

  • Dedicated jobs: If you or your supervisor own nodes on the system, you can submit jobs to the appropriate queue and gain full access to these, kicking anyone else off. Once you have access to your nodes, you can do what you like. An example would be the CAS queue cas.

  • Backfill jobs: The default is to submit a job to the Backfill queue kamiak. These will run on whatever nodes are not occupied, but can be preempted by the owners of the nodes. For this reason, you must implement a checkpoint-restart mechanism in your code so you can pickup where you left off when you get preempted.

On top of these, you can choose either background jobs (for computation) or interactive jobs (for development and testing).

Resources

When you submit a job, you must know:

  • How many nodes you need.

  • How many processes you will run.

  • Roughly how much memory you will need.

  • How long your job will take.

Make sure that your actual usage matches your request. To do this you must profile your code. Understand the expected memory and time usage before you run, then actually test this to make sure your code is doing what you expect. If you exceed the requested resources, you may slow down the cluster for other users. E.g. launching more processes than there are threads on a node will cause thread contention, significantly impacting the performance of your program and that of others.

Nodes are a shared resource - request only what you need and do not use more than you request.

Software

Much of the software on the system is managed by the Lmod module system. Custom software can be installed by sending service requests, or built in your own account. I maintain an up-to-date conda installation and various environments.

Preliminary

SSH

To connect to the cluster, I recommend configuring your local SSH server with something like this. (Change m.forbes to your username!)

# ~/.ssh/config Host kamiak HostName kamiak.wsu.edu User m.forbes ForwardAgent yes Host cn* ProxyCommand ssh kamiak nc %h %p User m.forbes ForwardAgent yes # The following are for jupyter notebooks. Run with: # jupyter notebook --port 18888 # and connect with # https://localhost:18888 ######## PORT FORWARDING TO NODES DOES NOT WORK. #LocalForward 10001 localhost:10001 #LocalForward 10002 localhost:10002 #LocalForward 10003 localhost:10003 #LocalForward 18888 localhost:18888 # The following is for snakeviz #LocalForward 8080 localhost:8080

This will allow you to connect with ssh kamiak rather than ssh [email protected]. Then use ssh-keygen to create a key and copy it to kamiak:~/.ssh/authorized_keys. The second entry allows you to directly connect to the compute nodes, forwarding ports so you can run Jupyter notebooks. Only do this for nodes for which you have been granted control through the scheduler.

Interactive Queue

Before doing any work, be sure to start an interactive session on one of the nodes. (Do not do work on the login nodes, this is a violation of the Kamiak user policy.) Once you have tested and profiled your code, run it with a non-interactive job in the batch queue.

$ idev --partition=kamiak -t 60

Home Setup

I have included the following setup. This will cause your ~/.bashrc file to load some environmental variables, and create links to the data directory.

ln -s /data/lab/forbes ~/data ln -s ~/data/bashrc.d/inputrc ~/.inputrc # Up-arrow history for commands ln -s ~/data/bashrc.d/bash_alias ~/.bash_alias # Sets up environment

If you do not have a .bashrc file, then you can copy mine and similar related files.

cp ~/data/bashrc.d/bashrc ~/.bashrc cp ~/data/bashrc.d/bash_profile ~/.bash_profile cp ~/data/bashrc.d/hgrc ~/.hgrc cp ~/data/bashrc.d/hgignore ~/.hgignore

If you do have one, then you can append these commands using cat:

cat >> ~/.bashrc <<EOF # .bashrc # Source global definitions if [ -f /etc/bashrc ]; then . /etc/bashrc fi # Uncomment the following line if you don't like systemctl's auto-paging feature: # export SYSTEMD_PAGER= # Source global definitions if [ -f ~/.bash_alias ]; then . ~/.bash_alias fi # Load the conda module which has mercurial # Load the conda module which has mercurial and mr module load conda mr conda activate EOF

In addition to this, you want to make sure that your .bashrc file loads any required modules that might be needed by default. For example. If you want to be able to hg push code to Kamiak, you will need to ensure that an appropriate module is loaded with mercurial. This can be done with the conda module below which is what I do above.

Make sure you add your username to the .hgrc file, create it:

# Mercurial (hg) Init File; -*-Shell-script-*- # dest = ~/.hgrc # Keep this as the 2nd line for mmf_init_setup # # Place site-specific customizations in the appropriate .hg_site file. [ui] ######## Be sure to add a name here, or to your ~/.hgrc_personal file. #username = Your Full Name <[email protected]> # Common global ignores ignore.common = ~/.hgignore [extensions] graphlog = extdiff = rebase = record = histedit =

Conda

I do not have a good solution yet for working with Conda on Kamiak. Here are some goals and issues:

Goals

  • Allow users to work with custom environments ensuring reproducible computing.

  • Allow users to install software using conda. (The other option is to use pip, but I am migrating to make sure all of my packages are available on my mforbes anaconda channel.

Issues

  • Working with conda in the user's home directory (default) or on /scratch is very slow. For some timings, installing a minimal python3 two times in succession (so that the second time needs no downloads). We also compare the time required to copy the environment to the Home directory, and the time it takes to run rm -r pkgs envs:

LocationFresh InstallSecond InstallCopy HomeRemoval
Home3m32s1m00sN/A1m03s
Scratch2m16s0m35s2m53s0m45s
Local0m46s0m11s1m05s0m00s

Recommendation

  • If you need a custom environment, use the Local drive /local and build it at the start of your job. A full anaconda installation takes about 5m24s on /local.

  • If you need a persistent environment, build it in your Home directory, but keep the pkgs directory on Scratch or Local to avoid exceeding your quota. (Note: conda environments are not relocatable, so you can't just copy the one you built on Local to your home directory. With the copy speeds, it is faster just to build the environment again.)

Playing with Folders

We will need to manage our own environment so we can install appropriate versions of the python software stack. In principle this should be possible with Anaconda 4.4 (see this issue – Better support for conda envs accessed by multiple users – for example), but Kamiak does not yet have this version of Conda. Untill then, we maintain our own stack.

Conda Root Installation

We do this under our lab partition /data/forbes/apps/conda so that others in our group can share these environments. To use these do the following:

  1. module load conda: This will allow you to use our conda installation.

  2. conda activate: This activates the base environment with hg and git-annex.

  3. conda env list: This will show you which environments are available. Choose the appropriate one and then:

  4. conda activate --stack <env>: This will activate the specified environment, stacking this on top of the base environment so that you can continue to use hg and git-annex.

  5. conda deactivate: Do this a couple of times when you are done to deactivate your environments.

  6. module unload conda: Optionally, unload the conda module.

Note: you do not need to use the undocumented --stack feature for just running code: conda activate <env> will be fine.

Primary Conda Environments (OLD)

conda create -y -n work2 python=2 conda install -y -n work2 anaconda conda update -y -n work2 --all conda install -y -n work2 accelerate conda create -y -n work3 python=3 conda install -y -n work3 anaconda conda update -y -n work3 --all conda install -y -n work3 accelerate for _e in work2 work3; do . activate $_e pip install ipdb \ line_profiler \ memory_profiler \ snakeviz \ uncertainties \ xxhash \ mmf_setup done module load cuda/8.0.44 # See below - install cuda and the module files first for _e in work2 work3; do . activate $_e pip install pycuda \ scikit-cuda done

Once these base environments are installed, we lock the directories so that they cannot be changed accidentally.

To use python, first load the module of your choice:

[cn14] $ module av ... anaconda2/2.4.0 anaconda2/4.2.0 (D) anaconda3/2.4.0 anaconda3/4.2.0 anaconda3/5.1.0 (D) [cn14] $ module load anaconda3

Now you can create an environment in which to update everything.

[cn14] $ conda create -n work3 python=3 Solving environment: done ## Package Plan ## environment location: /home/m.forbes/.conda/envs/work3 added / updated specs: - python=3 The following packages will be downloaded: package | build ---------------------------|----------------- certifi-2018.11.29 | py37_0 146 KB wheel-0.33.1 | py37_0 39 KB pip-19.0.3 | py37_0 1.8 MB python-3.7.2 | h0371630_0 36.4 MB setuptools-40.8.0 | py37_0 643 KB ------------------------------------------------------------ Total: 39.0 MB The following NEW packages will be INSTALLED: ca-certificates: 2019.1.23-0 certifi: 2018.11.29-py37_0 libedit: 3.1.20181209-hc058e9b_0 libffi: 3.2.1-hd88cf55_4 libgcc-ng: 8.2.0-hdf63c60_1 libstdcxx-ng: 8.2.0-hdf63c60_1 ncurses: 6.1-he6710b0_1 openssl: 1.1.1b-h7b6447c_0 pip: 19.0.3-py37_0 python: 3.7.2-h0371630_0 readline: 7.0-h7b6447c_5 setuptools: 40.8.0-py37_0 sqlite: 3.26.0-h7b6447c_0 tk: 8.6.8-hbc83047_0 wheel: 0.33.1-py37_0 xz: 5.2.4-h14c3975_4 zlib: 1.2.11-h7b6447c_3 Proceed ([y]/n)? y Downloading and Extracting Packages certifi-2018.11.29 | 146 KB | ################################################################################################################################################################### | 100% wheel-0.33.1 | 39 KB | ################################################################################################################################################################### | 100% pip-19.0.3 | 1.8 MB | ################################################################################################################################################################### | 100% python-3.7.2 | 36.4 MB | ################################################################################################################################################################### | 100% setuptools-40.8.0 | 643 KB | ################################################################################################################################################################### | 100% Preparing transaction: done Verifying transaction: done Executing transaction: done # # To activate this environment, use: # > source activate work3 # # To deactivate an active environment, use: # > source deactivate #

Now you can activate work3 and update anaconda etc.

[cn14] $ . activate work3 (work3) [cn14] $ conda install anaconda Solving environment: done ## Package Plan ## environment location: /home/m.forbes/.conda/envs/work3 added / updated specs: - anaconda The following packages will be downloaded: package | build ---------------------------|----------------- anaconda-2018.12 | py37_0 11 KB keyring-17.0.0 | py37_0 49 KB dask-core-1.0.0 | py37_0 1.2 MB ... ------------------------------------------------------------ Total: 559.3 MB The following NEW packages will be INSTALLED: alabaster: 0.7.12-py37_0 anaconda: 2018.12-py37_0 anaconda-client: 1.7.2-py37_0 ... The following packages will be DOWNGRADED: ca-certificates: 2019.1.23-0 --> 2018.03.07-0 libedit: 3.1.20181209-hc058e9b_0 --> 3.1.20170329-h6b74fdf_2 openssl: 1.1.1b-h7b6447c_0 --> 1.1.1a-h7b6447c_0 pip: 19.0.3-py37_0 --> 18.1-py37_0 python: 3.7.2-h0371630_0 --> 3.7.1-h0371630_7 setuptools: 40.8.0-py37_0 --> 40.6.3-py37_0 wheel: 0.33.1-py37_0 --> 0.32.3-py37_0 Proceed ([y]/n)? y Downloading and Extracting Packages anaconda-2018.12 | 11 KB | ################################################# | 100% ... (work3) $ du -sh .conda/envs/* 36M .conda (work2) $ du -sh /opt/apps/anaconda2/4.2.0/ 2.2G /opt/apps/anaconda2/4.2.0/

Some files are installed, but most are linked so this does not create much of a burden.

Issues

The currently recommended approach for setting up conda is to source the file .../conda/etc/profile.d/conda.sh. This does not work well with the module system, so I had to write a custom module file that does what this file does. This may get better in the future if the following issues are dealt with:

References

Inspecting the Cluster

Sometimes you might want to see what is happening with the cluster and various jobs.

Queue

To see what jobs have been submitted use the squeue command.

squeue

Nodes

Suppose you are running on a node and performance seems to be poor. It might be that you are overusing the resources you have requested. To see this, you can log into the node and use the top command. For example:

$ squeue -u m.forbes JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 661259 kamiak idv4807 m.forbes R 2:41 1 cn94 $ squeue -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %C %R" -w cn94 JOBID PARTITION NAME USER ST TIME NODES CPUS NODELIST(REASON) 653445 kamiak SCR5 l... R 4-11:28:12 1 4 cn94 653448 kamiak SCR18 l... R 3-12:59:43 1 8 cn94 654674 kamiak SCR10 l... R 2-06:26:03 1 4 cn94 654675 kamiak SCR12 l... R 2-06:26:03 1 4 cn94 659459 kamiak meme1 e... R 2-06:26:03 1 1 cn94 660544 kamiak meme2 e... R 3-08:20:33 1 1 cn94 661259 kamiak idv4807 m... R 7:17 1 5 cn94

This tells us that I have 1 jobs running on note cn94 which requested 5 CPUs, while user l... is running 4 jobs having requested a total of 20 CPUs, and user e... is running 2 jobs, having requested 1 CPU each. (Note: to see the number of CPUs, I needed to manually adjust the format string as described in the manual.)

Node Capabilities

To see what the compute capabilities of the node are, you can use the lscpu command:

[cn94] $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 28 On-line CPU(s) list: 0-27 Thread(s) per core: 1 Core(s) per socket: 14 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz Stepping: 1 CPU MHz: 2404.687 CPU max MHz: 3200.0000 CPU min MHz: 1200.0000 BogoMIPS: 3990.80 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 35840K NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts

This tells us sum information about the node, including that there are 14 cores per socket and 2 sockets, for a total of 28 cores on the node, so the 27 requested CPUs above should run fine.

Node Usage

To see what is actually happening on the node, we can log in and run top:

$ ssh cn94 $ top -n 1 Tasks: 772 total, 14 running, 758 sleeping, 0 stopped, 0 zombie %Cpu(s): 46.5 us, 0.1 sy, 0.0 ni, 53.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 13172199+total, 10478241+free, 23872636 used, 3066944 buff/cache KiB Swap: 0 total, 0 free, 0 used. 10730780+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20936 e... 20 0 1335244 960616 1144 R 3.6 0.7 4769:39 meme 30839 l... 20 0 1350228 0.995g 7952 R 3.6 0.8 236:38.54 R 30853 l... 20 0 1350228 0.993g 7952 R 3.6 0.8 236:38.75 R 30862 l... 20 0 1350228 0.995g 7952 R 3.6 0.8 236:37.37 R 122856 l... 20 0 1989708 1.586g 7988 R 3.6 1.3 1452:29 R 122865 l... 20 0 1989704 1.585g 7988 R 3.6 1.3 1452:25 R 124397 l... 20 0 1885432 1.514g 7988 R 3.6 1.2 1434:18 R 124410 l... 20 0 1885428 1.514g 7988 R 3.6 1.2 1434:17 R 124419 l... 20 0 1885428 1.514g 7988 R 3.6 1.2 1434:17 R 26811 l... 20 0 2710944 2.259g 7988 R 3.6 1.8 2595:41 R 26833 l... 20 0 2710940 2.262g 7988 R 3.6 1.8 2595:51 R 122847 l... 20 0 1989700 1.585g 7988 R 3.6 1.3 1452:29 R 170160 e... 20 0 1150992 776276 1140 R 3.6 0.6 3216:06 meme 50214 m.forbes 20 0 168700 3032 1612 S 0.0 0.0 0:02.60 top

Here I am just looking with top, but the other users are running 13 processes that are each using a full CPU on the node. The 3.6% = 1/28, since the node has 28 CPUs. (To see this view, you might have to press "Shift-I" while running top to disable Irix mode. If you want to save this as the default, press "Shift-W" which will write the defaults to your ~/.toprc file.)

Note: there are several key-stroke commands you can use while running top to adjust the display. When two options are available, the lower-case version affects the listing below for each process, while the upper-case version affects the top summary line:

  • e/E: Changes the memory units.

  • I: Irix mode - toggles between CPU usage as a % of node capability vs as a % of CPU capability.

Software

Modules

To find out which modules exist, run module avail:

[cn112] $ module avail ----------------------------------------- Compilers ------------------------------------------ StdEnv (L) gcc/6.1.0 intel/xe_2016_update3 (L,D) gcc/4.9.3 gcc/7.3.0 (D) intel/16.2 gcc/5.2.0 intel/xe_2016_update2 intel/16.3 ------------------------------- intel/xe_2016_update3 Software ------------------------------- bazel/0.4.2 espresso/5.3.0 (D) hdf5/1.10.2 nwchem/6.8 (D) cmake/3.7.2 espresso/6.3.0 lammps/16feb16 octave/4.0.1 corset/1.06 fftw/3.3.4 mvapich2/2.2 siesta/4.0_mpi dmtcp/2.5.2 gromacs/2016.2_mdrun netcdf/4 (D) stacks/1.44 eems/8ee979b gromacs/2016.2_mpi (D) netcdf/4.6.1 stacks/2.2 (D) elpa/2016.05.003 hdf5/1.8.16 (D) nwchem/6.6 --------------------------------------- Other Software --------------------------------------- anaconda2/2.4.0 git/2.6.3 python/2.7.10 (D) anaconda2/4.2.0 (D) globus/6.0 python/2.7.15 anaconda3/2.4.0 google_sparsehash/4cb9240 python2/2.7.10 (D) anaconda3/4.2.0 graphicsmagick/1.3.10 python2/2.7.15 anaconda3/5.1.0 (D) grass/6.4.6 python3/3.4.3 angsd/9.21 grass/7.0.5 python3/3.5.0 armadillo/8.5.1 grass/7.6.0 (D) python3/3.6.5 (D) arpack/3.6.0 gsl/2.1 qgis/2.14.15 bamaddrg/1.0 hisat2/2.1.0 qgis/3.4.4 (D) bamtools/2.4.1 htslib/1.8 qscintilla/2.9.4 bcftools/1.6 imagemagick/7.0.7-25 qscintilla/2.10 (D) beagle/3.0.2 interproscan/5.27.66 r/3.2.2 beast/1.8.4 iperf/3.1.3 r/3.3.0 beast/1.10.0 (D) java/oracle_1.8.0_92 (D) r/3.4.0 bedtools/2.27.1 java/11.0.1 r/3.4.3 binutils/2.25.1 jellyfish/2.2.10 r/3.5.1 blast/2.2.26 jemalloc/3.6.0 r/3.5.2 (D) blast/2.7.1 (D) jemalloc/4.4.0 (D) rampart/0.12.2 bonnie++/1.03e laszip/2.2.0 repeatmasker/4.0.7 boost/1.59.0 ldhot/1.0 rmblast/2.2.28 bowtie/1.1.2 libgeotiff/1.4.0 rmblast/2.6.0 (D) bowtie2/2.3.4 libint/1.1.4 rsem/1.3.1 bowtie2/2.3.4.3 (D) libkml/1.3.0 salmon/0.11.3 bwa/0.7.17 liblas/1.8.0 samtools/1.3.1 canu/1.3 libspatialite/4.3.0a samtools/1.6 cast/dbf2ec2 libxsmm/1.4.4 samtools/1.9 (D) ccp4/7.0 libzip/1.5.1 settarg/6.0.1 cellranger/2.1.0 lmod/6.0.1 shelx/2016.1 cellranger/3.0.2 (D) lobster/2.1.0 shore/0.9.3 centrifuge/1.0.4 matlab/r2018a shoremap/3.4 cp2k/4.1_pre_openmp matlab/r2018b (D) singularity/2.3.1 cp2k/4.1_pre_serial mercurial/3.7.3-1 singularity/2.4.2 cp2k/4.1 (D) mesa/17.0.0 singularity/3.0.0 (D) cuda/7.5 migrate/3.6.11 smbnetfs/0.6.0 cuda/7.5.18 miniconda3/3.6 sqlite3/3.25.1 cuda/8.0.44 mocat2/2.0 sratoolkit/2.8.0 cuda/9.0.176 mothur/1.40.5 stringtie/1.3.5 cuda/9.1.85 (D) music/4.0 superlu/4.3_dist cudnn/4_cuda7.0+ mysql/8.0.11 superlu/5.2.1 cudnn/5.1_cuda7.5 mzmine/2.23 superlu/5.4_dist (D) cudnn/5.1_cuda8.0 namd/2.12_ib svn/2.7.10 cudnn/6.0_cuda8.0 namd/2.12_smp swig/3.0.12 cudnn/7.0_cuda9.1 namd/2.12 (D) tassel/3.0 cudnn/7.1.2_cuda9.0 netapp/5.4p1 tcl-tk/8.5.19 cudnn/7.1.2_cuda9.1 (D) netapp/5.5 (D) texinfo/6.5 cufflinks/2.2.1 octave/4.2.0 texlive/2018 dislin/11.0 octave/4.4.0 tiff/3.9.4 dropcache/master octave/4.4.1 (D) tophat/2.1.1 eigan/3.3.2 openblas/0.2.18_barcelona towhee/7.2.0 emboss/6.6.0 openblas/0.2.18_haswell trimmomatic/0.38 exonerate/2.2 openblas/0.2.18 trinity/2.2.0 exonerate/2.4 (D) openblas/0.3.0 (D) trinity/2.8.4 (D) fastqc/0.11.8 orangefs/2.9.6 underworld/1.0 fastx_toolkit/0.0.14 parallel/3.22 underworld2/2.5.1 freebayes/1.1.0 parallel/2018.10.22 (D) underworld2/2.6.0dev (D) freebayes/1.2.0 (D) parflow/3.2.0 valgrind/3.11.0 freetype/2.7.1 parmetis/4.0.3 vcflib/1.0.0-rc2 freexl/1.0.2 paxutils/2.3 vcftools/0.1.16 gatk/3.8.0 perl/5.24.1 (D) vmd/1.9.3 gdal/2.0.0 perl/5.28.0 workspace_maker/master (L,D) gdal/2.1.0 pexsi/0.9.2 workspace_maker/1.1b gdal/2.3.1 (D) phenix/1.13 workspace_maker/1.1 gdb/7.10.1 picard/2.18.6 workspace_maker/1.2 geos/3.5.0 proj/4.9.2 wrf/3.9.1 geos/3.6.2 (D) proj/5.1.0 (D) zlib/1.2.11 ------------------------------------- Licensed Software -------------------------------------- amber/16 clc_genomics_workbench/8.5.1 (D) green/1.0 buster/17.1 dl_polly/4.08 stata/14 clc_genomics_workbench/6.0.1 gaussian/09.d.01 vasp/5.4.4 Where: L: Module is loaded D: Default Module Use "module spider" to find all possible modules. Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".

You can alo use module spider for searching. For example, to find all the modules related to conda you could run:

[cn112] $ module -r spider ".*conda.*" ---------------------------------------------------------------------------- anaconda2: ---------------------------------------------------------------------------- Description: Anaconda is a freemium distribution of the Python programming language for large-scale data processing, predictive analytics, and scientific computing. Versions: anaconda2/2.4.0 anaconda2/4.2.0 ---------------------------------------------------------------------------- For detailed information about a specific "anaconda2" module (including how to load the modules) use the module's full name. For example: $ module spider anaconda2/4.2.0 ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- anaconda3: ---------------------------------------------------------------------------- Description: Anaconda is a distribution of the Python programming language that includes the Python interpeter, as well as Conda which is a package and virtual environment manager, and a large collection of Python scientific packages. Anaconda3 uses python3, which it also calls python. Anaconda Navigator contains Jupyter Notebook and the Spyder IDE. Versions: anaconda3/2.4.0 anaconda3/4.2.0 anaconda3/5.1.0 ---------------------------------------------------------------------------- For detailed information about a specific "anaconda3" module (including how to load the modules) use the module's full name. For example: $ module spider anaconda3/5.1.0 ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- conda: conda ---------------------------------------------------------------------------- Description: Michael Forbes custom Conda environment. This module can be loaded directly: module load conda ---------------------------------------------------------------------------- miniconda3: miniconda3/3.6 ---------------------------------------------------------------------------- Description: Miniconda is a distribution of the Python programming language that includes the Python interpeter, as well as Conda which is a package and virtual environment manager. Miniconda3 uses python3, which it also calls python. You will need to load all module(s) on any one of the lines below before the "miniconda3/3.6" module is available to load. gcc/4.9.3 gcc/5.2.0 gcc/6.1.0 gcc/7.3.0 intel/16.2 intel/16.3 intel/xe_2016_update2 intel/xe_2016_update3 Help: For further information, see: https://conda.io/miniconda.html To create a local environment using the conda package manager: conda create -n myenv To use the local environment: source activate myenv To install packages into your local environment: conda install somePackage To install packages via pip: conda install pip pip install somePackage When installing, the "Failed to create lock" message can be ignored. Miniconda3 uses python3, which it also calls python. To use a different version for the name python: conda install python=2

To inspect the actual module file (for example, if you would like to make your own based on this) you can use the module show command:

$ module show anaconda3 ------------------------------------------------------ /opt/apps/modulefiles/Other/anaconda3/5.1.0.lua: ------------------------------------------------------ whatis("Description: Anaconda is a distribution of the Python programming language...") help([[For further information...]]) family("conda") family("python2") family("python3") prepend_path("PATH","/opt/apps/anaconda3/5.1.0/bin") prepend_path("LD_LIBRARY_PATH","/opt/apps/anaconda3/5.1.0/lib") prepend_path("LIBRARY_PATH","/opt/apps/anaconda3/5.1.0/lib") prepend_path("CPATH","/opt/apps/anaconda3/5.1.0/include") prepend_path("MANPATH","/opt/apps/anaconda3/5.1.0/share/man")

Running Jobs

Before you consider running a job, you need to profile your code to determine the following:

  • How many nodes and how many cores-per-node do you need?

  • How much memory do you need per node?

  • How long will your program run?

  • What modules do you need to load to run your code?

  • What packages need to be installed to run your code?

Once you have this information, make sure that your code is committed to a repository, then clone this repository to Kamiak. Whenever you perform a serious calculation, you should make sure you are running from a clean checkout of a repository with a well-defined set of libraries installed so that your runs are reproducible. This information should be stored along side your data so that you know exactly what version of your code produced the data.

Here are my recommended steps.

  1. Run an interactive session.

  2. Log in directly to node so agent get forwarded.

  3. Checkout your code into a repository.

    mkdir ~/repositories cd repositories hg clone ...
  4. Link your run folder to ~/now.

  5. Make a SLURM file in ~/runs.

#!/bin/bash #SBATCH --partition=kamiak ### Partition (like a queue in PBS) #SBATCH --job-name=HiWorld ### Job Name #SBATCH --output=Hi.out ### File in which to store job output #SBATCH --error=Hi.err ### File in which to store job error messages #SBATCH --time=0-00:01:00 ### Wall clock time limit in Days-HH:MM:SS #SBATCH --nodes=1 ### Node count required for the job #SBATCH --ntasks-per-node=1 ### Nuber of tasks to be launched per Node ./hello

Issues

Interactive Jobs do not ForwardAgent

Jupyter Notebook: Tunnel not working

For some reason, trying to tunnel to compute nodes is failing. It might be administrative settings disallow TCP through tunnels, or it might be something with the multi-hop.

Mercurial and Conda

I tried the usual approach of putting mercurial in the conda base environment, but when running conda, mercurial cannot be found. Instead, one needs to load the mercurial module. I need to see if this will work with with mmfhg.

Permissions

Building and Installing Software

The following describes how I have built and installed various pieces of software. You should not do this - just use the software as described above. However, this information may be useful if you need to install your own software.

#mkdir -p /data/lab/forbes # Provided by system. ln -s /data/lab/forbes ~/data mkdir -p ~/data/modules ln -s ~/data/modules ~/.modules mkdir -p ~/data/bashrc.d cat > ~/data/bashrc.d/inputrc <<EOF # Link to ~/.inputrc "\M-[A": history-search-backward "\M-[B": history-search-forward "\e[A": history-search-backward "\e[B": history-search-forward EOF cat > ~/data/bashrc.d/bash_alias <<EOF # Link to ~/.bash_alias # User specific aliases and functions export INPUTRC=~/.inputrc # Custom module files export MODULEPATH="~/.modules:~/data/modules:/data/lab/forbes/modules:${MODULEPATH}" # Load the conda module which has mercurial module load conda EOF

Conda

Our base conda environment is based on the mforbes/base environment and includes:

  • Mercurial, with topics and the hg-git bridge.

  • Black

  • Anaconda Project

  • Poetry

  • mmf-setup

  • nox and nox-poetry

chmod a+rx /data/lab/forbes/ module load intel/xe_2016_update3 wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh bash Miniconda2-latest-Linux-x86_64.sh -b -f -p /data/lab/forbes/apps/conda rm Miniconda2-latest-Linux-x86_64.sh cat > /data/lab/forbes/apps/conda/.condarc <<EOF # System configuration override. channels: - mforbes - defaults #- conda-forge # Don't do this by default -- too slow #create_default_packages: # - ipykernel # No point until forwarding works. EOF

To create and update environments:

module load conda # Requires conda.lua below conda activate base conda install anaconda-client conda env update mforbes/base conda deactivate conda env update -n jupyter mforbes/jupyter conda env update -n work mforbes/work conda env create mforbes/_gpe

conda.lua

cat > ~/.modules/conda.lua <<EOF -- -*- lua -*- whatis("Description: Michael Forbes custom Conda environment.") setenv("_CONDA_EXE", "/data/lab/forbes/apps/conda/bin/conda") setenv("_CONDA_ROOT", "/data/lab/forbes/apps/conda") setenv("CONDA_SHLVL", "0") set_shell_function("_conda_activate", [[ if [ -n "${CONDA_PS1_BACKUP:+x}" ]; then PS1="$CONDA_PS1_BACKUP"; \unset CONDA_PS1_BACKUP; fi; \local ask_conda; ask_conda="$(PS1="$PS1" $_CONDA_EXE shell.posix activate "$@")" || \return $?; \eval "$ask_conda"; \hash -r]], "" ) set_shell_function("_conda_deactivate", [[ \local ask_conda; ask_conda="$(PS1="$PS1" $_CONDA_EXE shell.posix deactivate "$@")" || \return $?; \eval "$ask_conda"; \hash -r]], "" ) set_shell_function("_conda_reactivate", [[ \local ask_conda; ask_conda="$(PS1="$PS1" $_CONDA_EXE shell.posix reactivate)" || \return $?; \eval "$ask_conda"; \hash -r]], "") set_shell_function("conda", [[ if [ "$#" -lt 1 ]; then $_CONDA_EXE; else \local cmd="$1"; shift; case "$cmd" in activate) _conda_activate "$@"; ;; deactivate) _conda_deactivate "$@"; ;; install|update|uninstall|remove) $_CONDA_EXE "$cmd" "$@" && _conda_reactivate; ;; *) $_CONDA_EXE "$cmd" "$@"; ;; esac fi]], "echo Conda C" ) -- prepend_path("PATH", "/data/lab/forbes/apps/conda/bin") -- prepend_path("LD_LIBRARY_PATH", "~/data/apps/conda/lib") always_load("intel/xe_2016_update3") family("conda") family("python2") family("python3") --[[ Build: # mkdir -P /data/lab/forbes/ # module load intel/xe_2016_update3 # wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh # bash Miniconda2-latest-Linux-x86_64.sh -b -f -p /data/lab/forbes/apps/conda # rm Miniconda2-latest-Linux-x86_64.sh --]] EOF

MyRepos

data="/data/lab/forbes" mkdir -p "${data}/repositories" git clone git://myrepos.branchable.com/ "${data}/repositories/myrepos" mkdir -p "${data}/apps/myrepos/bin" ln -s "${data}/repositories/myrepos/mr" "${data}/apps/myrepos/bin/"
cat > ~/.modules/mr.lua <<EOF -- -*- lua -*- whatis("Description: myrepos (mr) Multiple repository management: https://myrepos.branchable.com.") prepend_path("PATH", "/data/lab/forbes/apps/myrepos/bin/") --[[ Build: data="/data/lab/forbes" mkdir -p "${data}/repositories" git clone git://myrepos.branchable.com/ "${data}/repositories/myrepos" mkdir -p "${data}/apps/myrepos/bin" ln -s "${data}/repositores/myrepos/mr" "${data}/apps/myrepos/bin/" --]] EOF

mmfhg

data="/data/lab/forbes" module load conda module load mr conda activate mkdir -p "${data}/repositories" hg clone ssh://hg@bitbucket.org/mforbes/mmfhg "${data}/repositories/mmfhg" cd "${data}/repositories/mmfhg" make install cat >> "${data}/bashrc.d/bash_alias" <<EOF export MMFHG=/data/lab/forbes/repositories/mmfhg export HGRCPATH="\${HGRCPATH}:\${MMFHG}/hgrc" . "\${MMFHG}/src/bash/completions.bash" EOF

To Do

mmfhg mmfutils mmf_setup hgrc mr gitannex get these working

Questions

Kamiak

How to forward and ssh port to a compute node?

How to use slurm script to configure environment and for interactive sessions?

Conda: best way to setup environments?

Some options:

  • Install environments on local scratch directory (only good for single node jobs).

  • Install into ~ but redirect conda package dir to local or scratch. (Makes sure we can use current package.)

  • Install in global scratch which is good for 2 weeks.

  1. Cloning the base environment? In principle this should allow one to reuse much of the installed material, but in practice it seems like everything gets downloaded again.

    • First remove my conda stuff from my .bashrc file.

    • Initial attempt. Install a package that is not in the installed anaconda distribution:

      module load anaconda3 conda install -c conda-forge uncertainties # Takes a long time...
    • Try creating a clone environment with conda create -n mmf --clone base. This is not a good option as kit downloads a ton of stuff into ~/.conda/envs/mmf and ~/.conda/pkgs.

$ module load anaconda3 $ conda env list # conda environments: # /data/lab/forbes/apps/conda /data/lab/forbes/apps/conda/envs/_gpe /data/lab/forbes/apps/conda/envs/jupyter /data/lab/forbes/apps/conda/envs/work2 work3 /home/m.forbes/.conda/envs/work3 base * /opt/apps/anaconda3/5.1.0 $ conda create -n mmf --clone base Source: /opt/apps/anaconda3/5.1.0 Destination: /home/m.forbes/.conda/envs/mmf The following packages cannot be cloned out of the root environment: - conda-env-2.6.0-h36134e3_1 - conda-4.5.12-py36_1000 - conda-build-3.4.1-py36_0 Packages: 270 Files: 4448
  1. Stack on top of another environment? This is an undocumented feature that allows you to stack environments. After playing with it a bit, however, it seems like it would only be useful for different applications, not for augmenting a python library.

$ conda install -c conda-forge -n mmf_stack uncertainties --no-deps

This fails because it does not install python. The previous python is used and it cannot see the new uncertainties package.

$ conda config --set max_shlvl 6 # Allows stacking $ time conda create -n mmf_stack # Create environment for stacking. Solving environment: done ==> WARNING: A newer version of conda exists. <== current version: 4.5.12 latest version: 4.6.14 Please update conda by running $ conda update -n base conda ## Package Plan ## environment location: /home/m.forbes/.conda/envs/mmf_stack Proceed ([y]/n)? y Preparing transaction: done Verifying transaction: done Executing transaction: done # # To activate this environment, use: # > source activate mmf_stack # # To deactivate an active environment, use: # > source deactivate # $ . /opt/apps/anaconda3/5.1.0/etc/profile.d/conda.sh # Source since module does not install anaconda properly. $ conda activate mmf_stack (mmf_stack) $ conda install -c conda-forge uncertainties Solving environment: done ==> WARNING: A newer version of conda exists. <== current version: 4.5.12 latest version: 4.6.14 Please update conda by running $ conda update -n base conda ## Package Plan ## environment location: /home/m.forbes/.conda/envs/mmf_stack added / updated specs: - uncertainties The following packages will be downloaded: package | build ---------------------------|----------------- libblas-3.8.0 | 8_openblas 6 KB conda-forge tk-8.6.9 | h84994c4_1001 3.2 MB conda-forge wheel-0.33.1 | py37_0 34 KB conda-forge liblapack-3.8.0 | 8_openblas 6 KB conda-forge setuptools-41.0.1 | py37_0 616 KB conda-forge uncertainties-3.0.3 | py37_1000 116 KB conda-forge libffi-3.2.1 | he1b5a44_1006 46 KB conda-forge bzip2-1.0.6 | h14c3975_1002 415 KB conda-forge numpy-1.16.3 | py37he5ce36f_0 4.3 MB conda-forge zlib-1.2.11 | h14c3975_1004 101 KB conda-forge pip-19.1 | py37_0 1.8 MB conda-forge openblas-0.3.6 | h6e990d7_1 15.8 MB conda-forge xz-5.2.4 | h14c3975_1001 366 KB conda-forge sqlite-3.26.0 | h67949de_1001 1.9 MB conda-forge openssl-1.1.1b | h14c3975_1 4.0 MB conda-forge certifi-2019.3.9 | py37_0 149 KB conda-forge libcblas-3.8.0 | 8_openblas 6 KB conda-forge readline-7.0 | hf8c457e_1001 391 KB conda-forge ncurses-6.1 | hf484d3e_1002 1.3 MB conda-forge python-3.7.3 | h5b0a415_0 35.7 MB conda-forge ------------------------------------------------------------ Total: 70.2 MB The following NEW packages will be INSTALLED: bzip2: 1.0.6-h14c3975_1002 conda-forge ca-certificates: 2019.3.9-hecc5488_0 conda-forge certifi: 2019.3.9-py37_0 conda-forge libblas: 3.8.0-8_openblas conda-forge libcblas: 3.8.0-8_openblas conda-forge libffi: 3.2.1-he1b5a44_1006 conda-forge libgcc-ng: 8.2.0-hdf63c60_1 libgfortran-ng: 7.3.0-hdf63c60_0 liblapack: 3.8.0-8_openblas conda-forge libstdcxx-ng: 8.2.0-hdf63c60_1 ncurses: 6.1-hf484d3e_1002 conda-forge numpy: 1.16.3-py37he5ce36f_0 conda-forge openblas: 0.3.6-h6e990d7_1 conda-forge openssl: 1.1.1b-h14c3975_1 conda-forge pip: 19.1-py37_0 conda-forge python: 3.7.3-h5b0a415_0 conda-forge readline: 7.0-hf8c457e_1001 conda-forge setuptools: 41.0.1-py37_0 conda-forge sqlite: 3.26.0-h67949de_1001 conda-forge tk: 8.6.9-h84994c4_1001 conda-forge uncertainties: 3.0.3-py37_1000 conda-forge wheel: 0.33.1-py37_0 conda-forge xz: 5.2.4-h14c3975_1001 conda-forge zlib: 1.2.11-h14c3975_1004 conda-forge Proceed ([y]/n)?

Presumably people can update software 2.

  • Currently it seems I need to use my own conda (until anaconda 4.4.0)

Programming

How to profile simple GPU code?

$ module load conda $ hg clone ssh://hg@bitbucket.org/mforbes/cugpe ~/work/mmfbb/cugpe $ cd current $ ln -s ~/work/mmfbb/cugpe cugpe $ cd cugpe $ module load cuda $ conda env update -f environment.cugpe.yml -p /data/lab/forbes/apps/conda/envs/cugpe

Investigations

Here we include some experiments run on Kamiak to see how long various things take. These results may change as the system undergoes transformations, so this information may be out of date.

Conda

Here we investigate the timing of creating some conda environments using the user's home directory vs /scratch, vs /local:

Home

$ time conda create -y -n mmf0 python=3 # Includes downloading packages real 3m32.787s $ time conda create -y -n mmf1 python=3 # Using downloaded packages real 1m0.429s $ time conda create -y -n mmf1c --clone mmf0 real 0m56.507s
$ du -sh ~/.conda/envs/* 182M /home/m.forbes/.conda/mmf0 59M /home/m.forbes/.conda/mmf1 59M /home/m.forbes/.conda/mmf1c $ du -shl ~/.conda/envs/* 182M /home/m.forbes/.conda/mmf0 182M /home/m.forbes/.conda/mmf1 182M /home/m.forbes/.conda/mmf1c $ du -sh ~/.conda/pkgs/ 341M /home/m.forbes/.conda/pkgs/

From this we see that there is some space saving from the use of hard-links. Note that the packages also take up quite a bit of space.

$ time rm -r envs pkgs/ real 1m2.734s

Scratch

mkworkspace -n m.forbes_conda mkdir /scratch/m.forbes_conda/envs mkdir /scratch/m.forbes_conda/pkgs ln -s /scratch/m.forbes_conda/envs ~/.conda/ ln -s /scratch/m.forbes_conda/pkgs ~/.conda/
$ time conda create -y -n mmf0 python=3 # Includes downloading packages real 2m16.052s $ time conda create -y -n mmf1 python=3 # Using downloaded packages real 0m35.337s $ time conda create -y -n mmf1c --clone mmf0 real 0m27.982s
$ time rm -r /scratch/m.forbes_conda/envs /scratch/m.forbes_conda/pkgs/ real 0m45.193s

Local

mkworkspace -n m.forbes_conda --backend=/local mkdir /local/m.forbes_conda/envs mkdir /local/m.forbes_conda/pkgs ln -s /local/m.forbes_conda/envs ~/.conda/ ln -s /local/m.forbes_conda/pkgs ~/.conda/
$ time conda create -y -n mmf0 python=3 # Includes downloading packages real 0m45.948s $ time conda create -y -n mmf1 python=3 # Using downloaded packages real 0m10.670s $ time conda create -y -n mmf1c --clone mmf0 real 1m42.742s
$ time rm -r /local/m.forbes_conda/envs/ /local/m.forbes_conda/pkgs/ real 0m0.387s

Home/Local

mkworkspace -n m.forbes_conda --backend=/local mkdir /local/scratch/m.forbes_conda/pkgs ln -s /local/scratch/m.forbes_conda/pkgs ~/.conda/
$ time conda create -y -n mmf0 python=3 # Includes downloading packages real 1m58.410s $ time conda create -y -n mmf1 python=3 # Using downloaded packages real 1m41.889s real 1m39.003s $ time conda create -y -n mmf1c --clone mmf0 real 1m42.742s
$ time rm -r /local/m.forbes_conda/envs/ /local/m.forbes_conda/pkgs/ real 0m0.387s

Local -> Home

$ my_workspace="$(mkworkspace -n m.forbes_conda --backend=/local --quiet)" $ export CONDA_PKGS_DIRS="${my_workspace}/pkgs" $ conda_prefix="${my_workspace}/current_conda_env" $ time conda create -y --prefix "${conda_prefix}" python=3 real 0m16.295s $ time conda create -y --prefix ~/clone_env --clone "${conda_prefix}" real 0m49.573s $ time conda create -y --prefix ~/clone_env2 python=3 real 0m44.628s
$ my_workspace="$(mkworkspace -n m.forbes_conda --backend=/local --quiet)" $ export CONDA_PKGS_DIRS="${my_workspace}/pkgs" $ conda_prefix="${my_workspace}/current_conda_env" $ time conda env create --prefix "${conda_prefix}" mforbes/work real 0m16.295s $ time conda create -y --prefix ~/clone_env_work --clone "${conda_prefix}" $ time conda env create --prefix ~/clone_env_work2 mforbes/work real 14m21.985s $ time conda create -y --prefix ~/clone_env --clone "${conda_prefix}"