My password doesn't work but I know it is the correct one.
I am getting a "connection refused" error messsage when trying to connect to the cluster
The cluster is only accessible from on campus or via VPN and you are likely not on either. If you already have vpn access, you can connect to that first and then to the cluster. If you have a machine on campus you can connect to remotely, you can connect there and then into the cluster.
I frequently get disconnected while my SSH terminal is idle for extended periods of time.
If you find your SSH session frequently getting disconnected and are sure it's not VPN, wireless or local networks at fault, you may want to try adding the following to your local ssh configs.
ServerAliveInterval 60
or while connecting:
ssh -o "ServerAliveInterval 60" login.hpc.caltech.edu
You may also want to try Mosh which provides stateless ssh over UDP. Mosh is highly resilient to client side network drops, ip changes etc.
In your Cluster-side ~/.bashrc add the following.
module load mosh/1.4.0-gcc-11.3.1-72uzmod
On your client side, you would install mosh then use that command, rather than ssh to connect to the cluster. Note, you will still need to be connected over Caltech's VPN service. When using mosh, you'll need to target an individual login node (login3/4) instead of the load balancer, as the load balancer prevents mosh from working correctly. We are looking into possible workarounds.
mosh username@login3.hpc.caltech.edu
mosh username@login4.hpc.caltech.edu
I requested a lot of cores on a computer, but it is only using one.
If it is not an MPI job, then you executable may not be multithreaded, or you didn't specify the number of threads. If you know your application is multithreaded using openMP but it isn't using the additional cores, you may need to set the appropriate environment variable. Often you need to set something like the following:
export OMP_NUM_THREADS=32
export MKL_NUM_THREADS=32
Nested SRUNS fail on GPU nodes.
If you experience srun calls hanging on GPU nodes after starting an interactive session add this variable to your environment.
export SLURM_STEP_GRES=none
"Home directory not found" while connecting via Open OnDemand.
Out of Memory while running a process on the login or vislogin nodes.
Segfaults while running Python/Conda after connecting to cluster via Windows Subsystem for Linux (WSL).
echo "Setting LANG environment variable from $LANG to en_US.UTF-8" 1>&2 &&
export LANG="en_US.UTF-8"
libGL error: No matching fbConfigs or visuals found/error: failed to load driver: swrast while launching X11 forwarded applications.
with the following entries:
Section "ServerFlags"
Option "AllowIndirectGLX" "on"
Option "IndirectGLX" "on"
EndSection