Common Problems

Common problems and how to fix them.

My password doesn't work but I know it is the correct one.

First, try to log into https://access.caltech.edu with the same username and password. If you are able to log in to Access Caltech, then you have the correct username and password for the cluster. If you are still having trouble logging in to the cluster, you probably do not have the entitlement to log into the cluster. If your group has not yet been set up to access the cluster, then that should be done first . The PI or admin of your group should be able to add you on the HPC admin console. Your PI can also contact us and we can do it for them at [email protected].

I am getting a "connection refused" error messsage when trying to connect to the cluster

The cluster is only accessible from on campus or via VPN and you are likely not on either. If you already have vpn access, you can connect to that first and then to the cluster. If you have a machine on campus you can connect to remotely, you can connect there and then into the cluster.

I frequently get disconnected while my SSH terminal is idle for extended periods of time.

If you find your SSH session frequently getting disconnected and are sure it's not VPN, wireless or local networks at fault, you may want to try adding the following to your local ssh configs.

ServerAliveInterval 60

or while connecting:

ssh -o "ServerAliveInterval 60" login.hpc.caltech.edu

You may also want to try Mosh which provides stateless ssh over UDP. Mosh is highly resilient to client side network drops, ip changes etc.

In your Cluster-side ~/.bashrc add the following.

module load mosh/1.4.0-gcc-11.3.1-72uzmod

On your client side, you would install mosh then use that command, rather than ssh to connect to the cluster. Note, you will still need to be connected over Caltech's VPN service. When using mosh, you'll need to target an individual login node (login3/4) instead of the load balancer, as the load balancer prevents mosh from working correctly. We are looking into possible workarounds.

mosh [email protected]

I requested a lot of cores on a computer, but it is only using one.

If it is not an MPI job, then you executable may not be multithreaded, or you didn't specify the number of threads. If you know your application is multithreaded using openMP but it isn't using the additional cores, you may need to set the appropriate environment variable. Often you need to set something like the following:

export OMP_NUM_THREADS=32

export MKL_NUM_THREADS=32

Nested SRUNS fail on GPU nodes.

If you experience srun calls hanging on GPU nodes after starting an interactive session add this variable to your environment.

export SLURM_STEP_GRES=none

"Home directory not found" while connecting via Open OnDemand.

This usually occurs when a new user attempts initial login via the graphical Open OnDemand front end rather than terminal based SSH. Open OnDemand currently has no mechanism to create home directories but SSH does. Once you login via SSH for the first time your home directory will be created and subsequent OpenOnDemand sessions will function as intended.

Out of Memory while running a process on the login or vislogin nodes.

Cgroup limits are in effect on both the login and vislogin interactive nodes. To avoid this, run processes that use over 8GB of memory on compute nodes only. If the application requires that the job run on a login node please contact HPC support.

Segfaults while running Python/Conda after connecting to cluster via Windows Subsystem for Linux (WSL).

You may notice that some remote applications on the custer like Python may segfault while connected from WSL/Ubuntu terminals on Windows clients. In our experience, overriding the local LANG variable on the client side before connecting to the cluster resolves the issue.

For example:

export LANG=en_US.UTF-8

ssh login.hpc.caltech.edu

Another option would be to have it set/corrected on the cluster-side by updating ~/.bashrc etc.

[ "$LANG" != "en_US.UTF-8" ] &&
echo "Setting LANG environment variable from $LANG to en_US.UTF-8" 1>&2 &&
export LANG="en_US.UTF-8"

libGL error: No matching fbConfigs or visuals found/error: failed to load driver: swrast while launching X11 forwarded applications.

Indirect GLX has been deprecated on the client side for many years. You can try re-enabling iglx with the following suggestions.

In Ubuntu 22.04 (and possibly other recent versions)

edit/create /usr/share/X11/xorg.conf.d/50-iglx.conf

with the following entries:
Section "ServerFlags"
Option "AllowIndirectGLX" "on"
Option "IndirectGLX" "on"
EndSection

Now reboot the client and try again.

Mac's rely on XQuartz, so you can try setting some defaults or downgrade to a version that has iglx enabled by default.

defaults write org.macosforge.xquartz.X11 enable_iglx -bool true

Restart Xquartz