CITerra/Fram New User Guide

What it is

Fram is a high performance computing cluster built for the Geology and Planetary Sciences division at Caltech.  It is the latest of many clusters used by the division for analysis and simulation in the earth sciences.

Specifications

314 HP SL390 Compute Nodes of which 60 are GPU enabled
12 cores avaiable on each node (hyper-threading disabled)
Processors: Dual Westmere X5650 processors running at 2.67 Ghz

GPU Nodes each contain 3 Nvidia M2090s with 6GB memory each
180 M2090 GPUs total

Infiniband interconnect
Voltaire 324 port IB switch connected with fiber based IB cables
QDR Infiniband (HCAs are built into SL390 nodes)

DDN Storage
Raw space: 720TB
Usable Space: ~500TB
Filesystem: Lustre
2 Metadata Servers
4 Object Store Servers
2 Exascaler head nodes
240 7200RPM 3 TB Drives
Peak IO Client Bandwidth:  9.5 GB/s

Why it is called Fram

Fram was the name of a ship commissioned by Fridtjof Nansen for exploration of the arctic. Nansen was a Norwegian explorer, scientist, diplomat, humanitarian and Nobel Peace Prize laureate.  Nansen believed he could reach the North pole by starting in the east and using the natural currents to take him there.  The Fram was specifically commissioned to be frozen into the ice without damage and capable of housing enough fuel and provisions for 12 men for up to 5 years.

Fram was used by Otto Sverdrup (related to the famous oceanographer Harald Sverdrup)  from 1898 to 1902 to chart the Arctic islands and to sample the rocks, animals, and plants.  Then from 1910 to 1912 it was used by Roal Amundsen on the expedition that was the first to reach the South Pole.

Getting an account

To get an account on fram, Ask your PI to send an email to help-hpc@caltech.edu requesting one for you. If your group is already using the cluster, this is a very straightforward process.  You will need an account on the GPS cluster first, if you do not have one, you can email help@gps.caltech.edu to request one.

Connecting to the cluster

You will need to ssh to fram from a gps machine to get access.

If using a linux computer, simply open a terminal and type “ssh -lusername fram.gps.caltech.edu”

If using MacOSX, you can  open the terminal application in /Applications/Utilitie.s  From there you can also type “ssh -lusername fram.gps.caltech.edu”

If running Windows, you will need to get an ssh client.  One commonly used ssh application is Putty. You can get a copy here

When you ssh into fram, you will actually be connected to one of three login nodes not the headnode itself. From here you can do any interactive work that you need. Most notably, you will be compiling your applications and submitting jobs from these nodes. Make sure that you do not run jobs on the headnode, this will slow down interactive use for everyone.

Your first connection to the cluster

You should be connecting to fram from a computer on the gps network or via vpn.  Access is blocked from elsewhere. If you require VPN access for your Caltech account please see this link .If you are not in GPS and need to access it from another location, email you ip address to help-hpc@caltech.edu and we may be able to open it up if appropriate.

When you initially connect to the cluster, you will likely be asked to generate an ssh key pair.  You can do this simply by hitting Enter a few times until you are at a prompt.  This will create a passphraseless password for use within the cluster.  All jobs are launched using ssh, so this is an important step otherwise your jobs will not be able to spawn themselves across nodes without a password.  While it is import to use a passphraseless key within the cluster to ease running jobs across nodes it is extremely important not to use this key elsewhere as it is generally not a good idea outside of a secured environment.   Do not create passphrasless keys on remote systems to connect to fram.

Changing your Password or shell

By default, your initial password is set to your GPS password.  They are in no way kept in sync, so if you change one, the other will not change.  Our default shell is bash on the cluster.

If you would like to change either of these it needs to be done on the fram headnode. To get there, either ssh to fram-master.gps.caltech.edu from a GPS computer or ssh to fram-master.local from one of the login nodes.

To change the password run the “passwd” command on the headnode. To change your shell, run “chsh” on the headnode.

These changes will not be propogated to all nodes immediately. They will be propogated to the login nodes approximately every 15 minutes and to the compute nodes 12:00 to 12:30 every evening.  The times may vary a bit as the sync in splayed so as not to cause too much server load at any given time.

Managing Disk Usage

Quotas are set up on the lustre file system.  They are currently set at 3TB, but may need to change dependent on filesystem usage.  This is not storage meant for storing all of your data.  It is there to support your job runs.  Please keep it clean.  It is in no way backed up, so make sure that you have your data somewhere else that is backed up. To check you quota, use the following command:

lfs quota /global

Using the module environment variable management system

There are a lot of software packages compiled on the cluster. Some of these are the same application compiled in different ways. Some are different versions of the same application.  To be able to deal with all the packages, we have implemented the module system. This allows you to load up specific packages which will add them to your path and set up library directories. To run mpi applications, you will likely need to use this system to choose what you want want use.  Some modules require that other modules be loaded.  The error message that shows up will tell you which other modules you will need to load.  Some modules also conflict with others. Once again, the message will tell you which ones.

To see what modules are available use:

module avail

To load a specific module:

module load openmpi/gcc

module load cuda

To unload a module previously loaded:

module unload openmpi/gcc

To see what modules you have loaded currently

module list

If you use the same modules all the time you can add these commands to the bottom of your ~/.bash_profile file and it will be loaded evertime you log in.

Compiling your programs

To compile you MPI programs you will need to use the mpi implementation for infiniband. You can choose to use the gnu or intel compilers. These can be found using the module environment variable management system. Here are some examples of loading MPI into your path:

module load openmpi/gcc

module load mvapich2/gcc

module load mvapich/gcc

module load intel/impi

To see a full list, use “module avail”

Use mpicc for compiling C applications, mpif90 for fortran, and mpic++ for your c++ applications. If using Intel MPI, you use mpiifort instead of mpif90 if you want to compile with the intel compiler.

Some application will be more complicated to build.  Typically, these will have a Makefile.  You will need to edit that file in order to build the software.  If you run into trouble, it is usually best to ask a colleague who is already using it.  If you still have trouble, email us at help-hpc@caltech.edu and he will help as much as we can.

Creating your submission script

You will generally launch your jobs via a submission script. This will give information to the scheduler on what resources your job will require.  Let’s step through a typical script.

All scripts start with the shell it will be running in. In this case, bash

#!/bin/bash

You will generally need to give your job a name. This will make it eaier to identify in the scheduler queue

#PBS -N  My_Job_Name

Then we want to select a queue to run in. Generally you will use the default queue. There is also a debug queue for you use. This queue has some dedicated nodes, but shorter allowable run time and a job limit. If using GPUs in your job, you can use the gpu queue.

#PBS -q default

the you will want to decide on the total number of cores you would like your job on

#PBS -l nodes=12

This is your wall clock time. After this time has passed, your job will be terminated. The wall clock setting in every submitted job helps the scheduler predict openings in which to queue new jobs and also prevents rogue jobs from continuing to run after their allotted time slot. Using an accurate wall clock is essential in avoiding wasted compute resources due to inefficient scheduling of jobs.

#PBS -l walltime=0:03:00

You will want your job to have the same environment variables as the shell in which you are submitting it from.  The following option will do that.

#PBS -V

You can have the scheduler email you when you job begins, ends, or exits. To do this, you can use the following.

# PBS -m bae -M myemail@gps.caltech.edu

We generally add the following lines to our script as it gives useful information on the output and makes setting variables easier:

echo “MPI Used:” `which mpirun`

#change the working directory (default is home directory)
echo Working directory is $PBS_O_WORKDIR
cd $PBS_O_WORKDIR

# Write out some information on the job
echo Running on host `hostname`
echo Time is `date`

### Define number of processors
NPROCS=`wc -l < $PBS_NODEFILE`
echo This job has allocated $NPROCS cpus

# Tell me which nodes it is run on
echo ” ”
echo This jobs runs on the following processors:
echo `cat $PBS_NODEFILE`
echo ” “

Then we launch the job using mpirun with the correct options.  If you added the above lines to your script, you should only have to change the executable name and arguements for it

mpirun -machinefile $PBS_NODEFILE -np $NPROCS applicationame

Submitting your job

To submit your job you can use the qsub command.  If you have set up your submission script as above, you can simply run “qsub  scriptname” to send your job to the queue.  You can override various settings on the command line when submitting your job.  Here are some switches you can add to qsub to override what is in your script:

-l wallclock [hour:]minute
-J job_name
-q "queue_name ..."

There are many other options, but these are the most commonly used.  You can run “man qsub” to get a list of other options.

Using the parallel system for bundling single processor jobs

It is often advantageous  to bundle large numbers of single processor jobs into one. It is also best practices to do this as it ensures better utilization of the cluster. If you are doing this sort of work, make sure you use the parallel system.

We are now using the GNU parallel package for bundling this type of job.  To do this, create a file with each command you need to run.  Typically your environment variables are not copied to each subsequent node, so it is best if you set them in the command line or source a file that has them.  (This includes such variables as PBS_O_WORKDIR, the working directory set in your submission script). The file should have what one core should do per line.for example, the following as the input for parallel would source your .bash_profile, then run hostname, then run uptime for each core:

source ~/.bash_profile; hostname; uptime

source ~/.bash_profile; hostname; uptime

source ~/.bash_profile; hostname; uptime

source ~/.bash_profile; hostname; uptime

You can have as many commands as you like.  The parallel system will dole out processes to each of its cores as it finishes a previous one, so you can submit a file that has thousands of lines to a much smaller number of cores, and they will each be processed in turn.

Using this in your script is almost identical to running an mpi job.  Rather than specifying mpirun at the end of your file, specify parallel with its options.  Here is an example of doing this:

parallel –sshloginfile $PBS_NODEFILE  -a batch_command_file

Job management and monitoring

There are many ways to get information about the state of the cluster, queues, and running jobs.  To get an overview of the state of the cluster, you can run “cstat”  This will give you a summary of the number of processors used, pending, free, and various other real-time statistics.

To check what is running in the queue, you can use “showq” or “qstat”. This will show you all currently running, pending, and recently finished jobs.

To get more in depth information on a job, you can used the command “checkjob jobid”.  This will show you the submission commands, the nodes it is running on, pids, and various other information.

To get a view of a currently running job, use “qpeek jobid”. This will show you what will be written to you output file when the job is done.  This can be useful to check on where your job is at any given time.

Need help?

If you need help, there are a number of ways you can get it.  If you are having trouble compiling code that your group uses, the best resource is your colleagues. Many of them will already have the code running and can answer questions that are code specific.

If you are having technical difficulties, questions, or wish to report a problem, the best place to get help is to email help-hpc@caltech.edu. This will generate a ticket and many people are looking at this system.  If you suspect a problem of any sort, I encourage you to send an email here.  A problem that you think is just affecting you may be more systemic and your report will help everyone.

Leave a Reply