Deploying a Server for Bioinformatics Research

Recently, our lab acquired a server equipped with a standard Ubuntu operating system from Inspur, and I am tasked with configuring it to fulfill the specific requirements of our bioinformatics research. Given that my expertise in Linux is limited, I dedicated several days to this endeavor, and eventually completed the deployment process. The purpose of this guide is to assist researchers facing similar demands in comprehending the steps to configure their servers. Additionally, it aims to address potential issues that may arise during the configuration process, along with their respective solutions.

Create a new user (with root privileges)

Typically, the server comes with a default user named after the vendor, in my instance, inspur. This user is a regular user but can gain root privileges by executing commands using the sudo prefix followed by typing their password. To create a custom account with similar privileges, follow these steps:

sudo useradd -d "/home/<user_name>" -m -s "/bin/bash" <user_name>

-d "/volume1/home/<user_name>" will set /volume1/home/<user_name> as home directory of the new Ubuntu account.
-m will create the user’s home directory.
-s "/bin/bash"will set /bin/bash as login shell of the new account. This command will create a regular account <user_name>. If you want <user_name> to have root privileges, type:

sudo useradd -d "/home/<user_name>" -m -s "/bin/bash" -G sudo <user_name>

-G sudo ensures <user_name> to have admin access to the system.

To set the password of the new account, conduct:

sudo passwd <user_name>

After running the command, you will be prompted to type the password for the new account. Please note that Ubuntu will not display the password you are typing, either explicitly or implicitly (like dots). Just type the password you want to set and press Enter.

Change terminal prompt (optional)

A beautiful terminal prompt can bring a beautiful day. To change the terminal prompt, execute

cd ~
vim .bashrc

You will see a paragraph like this:

# uncomment for a colored prompt, if the terminal has the capability; turned
# off by default to not distract the user: the focus in a terminal window
# should be on the output of commands, not on the prompt
#force_color_prompt=yes

Uncomment the last line #force_color_prompt=yes. Below this paragraph you will also see some codes:

if [ "$color_prompt" = yes ]; then
    PS1='${debian_chroot:+($debian_chroot)}\[\033[01;32m\]\u@\h\[\033[00m\]:\[\033[01;34m\]\w\[\033[00m\]\$ '
else
    PS1='${debian_chroot:+($debian_chroot)}\u@\h:\w\$ '
fi

Modify the first PS1:

PS1='\[\033[35m\]\t\[\033[m\]-\[\033[36m\]\u\[\033[m\]@\[\033[32m\]\h:\[\033[33;1m\]\w\[\033[m\]\$ '

This is my PS1 value. Save the .bashrc file, close your current terminal and open a new one. The terminal prompt will look like this:

23:02:02-tdeng@inspur-NP5570M5:~/data$

Enable remote access

If you wish to access the server from outside its physical location, you need to enable remote access. In my case, I connected the server to the campus network, allowing me to access it from any location within the campus. To enable remote access, you need to install the openssh-server:

sudo apt update
sudo apt install openssh-server

If the firewall UFW is enabled, make sure to open the SSH port:

sudo ufw allow ssh

To test whether you can access the server from a Windows system:

telnet <remote_ip> <remote_port>

You can find the IP address using:

ip a

When you log in to the server using the newly created user with bash, you might encounter an error like this:

/usr/bin/xauth: file /home/<user_name>/.Xauthority does not exist

Solution:

chown <user_name>:<user_name> -R /home/<user_name>

Connect to GitHub

I suppose you already have a GitHub account. Install git first:

sudo apt install git
git --version

Then configure certain information about your GitHub account:

git config --global user.name "<github_account_name>"
git config --global user.email "<github_account_email>"

Connect to GitHub:

ssh-keygen -C "<github_account_email>" -t rsa  # default: just press Enter 3 times
cd ~/.ssh
vim id_rsa.pub  # open the id_rsa.pub file

Finally, copy the text in id_rsa.pub, log in GitHub, and create an SSH key at Settings → SSH and GPG keys → New SSH key.

Test the connection:

ssh -T git@github.com

Configure the Python environment

Install miniforge

Instead of Anaconda I decide to use Miniforge to manage multiple Python environments. It has several advantages over Anaconda:

The conda-forge channel is set as the default channel. So you don’t need to type -c conda-forge.
It uses Mamba, a very fast package manager (although Anaconda can also use Mamba, additional operations to set conda-libmamba-solver as the dfault solver are required).

You can consider Miniforge as an alternative to Anaconda. You can replace the conda command with mamba for a better interface, or you can simply keep using the conda command for a seamless replacement. Below I will only show the former approach.

To install Miniforge, just follow the installation guide in its README. Here I copy the core commands:

wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh

I prefer install anaconda at /usr/local/miniforge3 so that the environments can be shared by users (but only users with root can modify them). You don’t need to create this folder in advance. During the installation you will have chance to specify the installation directory.

To initialize mamba, conduct

/path/to/mamba init  # /usr/local/miniforge3/bin/mamba in my case

and reopen the terminal.

Create/delete environments

I recommend creating new environments and installing site packages with root privileges (sudo su) to restrict regular users from modifying the environments. If a regular user wants to update an environment, they should contact the system administrator for assistance. If he/she doesn’t and conduct a command secretly like

mamba update --all

he/she will proceed with the update plan but finally fail with error info:

Confirm changes: [Y/n] y

frozendict                                          49.0kB @  60.0kB/s  0.8s
libzlib                                             61.6kB @  72.1kB/s  0.9s
lzo                                                171.4kB @ 168.4kB/s  1.0s
menuinst                                           137.7kB @ 131.9kB/s  1.0s
libsolv                                            470.7kB @ 324.7kB/s  1.4s
conda                                              961.2kB @ 558.1kB/s  0.9s

Downloading and Extracting Packages:

Preparing transaction: done
Verifying transaction: failed
The current user does not have write permissions to the target environment.
  environment location: /usr/local/miniforge3
  uid: 1000
  gid: 1000



EnvironmentNotWritableError: The current user does not have write permissions to the target environment.
  environment location: /usr/local/miniforge3
  uid: 1000
  gid: 1000

The commands for creating new environment are:

# create with a specified name
mamba create --name <new_env_name> python=3.11
# create with a specified location; regular users can use this command to create an environment in their home directory
mamba create --prefix /path/to/directory python=3.11

--name <new_env_name> will set the name of the new environment.
--prefix /path/to/directory will set the path to the directory where you want to create the environment
python=3.11 means mamba will install Python 3.11 in the new environment.

I did not modify the base environment and proceeded to create two new environments: jupyter and bio. jupyter only contains packages related to jupyterhub, while bio encompasses all the necessary packages for research purposes.

If you wish to delete an environment for any reason, utilize the following command:

# delete with a specified name
mamba remove --name <env_name> --all
# delete with a specified location
mamba remove --prefix /path/to/directory --all

Install Python packages

JupyterHub

You may want to install JupyterHub, which serves Jupyter notebook for multiple users.

mamba install jupyterhub jupyterlab notebook jupyter-lsp-python jupyterlab-lsp jupyterlab-git

I recommend to install the jupyterlab-lsp, a powerful coding assistance for JupyterLab. Another useful plugin is jupyterlab-execute-time, which can display cell timings in JupyterLab. Use the following command to install it:

mamba install jupyterlab_execute_time

Refer to this website for the configuration of JupyterHub.

Refer to this website for how to run JupyterHub as a system service.

The key command to start the service on boot is:

sudo systemctl enable jupyterhub

From version 5.0, you must modify the jupyterhub_config.py file to grants users who can successfully authenticate access to the Hub. Check this official tutorial out.

Add/delete an environment as a kernel

To add an environment as a kernel:

mamba activate <env_name>  # or /path/to/directory if you create the env with --prefix
mamba install ipykernel  # if the env doesn't contain this package
python -m ipykernel install --name <kernel_name>

These commands add <env_name> environment as a kernel with name <kernel_name>. If your Python is 3.11, you may need to modify the last command:

python -Xfrozen_modules=off -m ipykernel install --name <kernel_name>

To delete a kernel:

jupyter kernelspec list
jupyter kernelspec uninstall <kernel_name>

Other packages

Our research involves deep learning, so I need to install pytorch along with other required packages. RAPIDS provides a series of packages that utilize GPUs. These packages are easier to install in a fresh environment so I recommend installing them first, following the Installation Guide. pytorch can be installed simultaneously with the guide.

mamba install ipykernel ipywidgets # for Jupyter Notebook
mamba install lightning  # for deep learning tasks
mamba install pyro-ppl numpyro funsor arviz  # for probabilistic programming
mamba install scanpy squidpy omicverse biopython rpy2 opencv   # for biological analysis
mamba install anndata2ri -c bioconda  # for conversion between Python and R
mamba install xgboost lightgbm catboost hdbscan optuna  # for machine learning tasks

Sometimes you may use mamba search <package_name> to search for a package with a specific build number. To install a specific version/build of a certain packages, conduct:

mamba install <package_name>=<version>=<build_string>

Check pytorch/tensorflow

If you are also a user of pytorch or tensorflow and you have one or more available GPU(s), you can execute the following codes to verify whether the GPU(s) can be recognized and utilized by the respective deep learning frameworks:

import torch
import tensorflow as tf

# check pytorch and cuda in use
print(torch.version.cuda)
print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.cuda.current_device())
print(torch.cuda.get_device_name(0))

# check tensorflow
print(tf.config.list_physical_devices('GPU'))

Here I also provide a script to ensure that pytorch can use the GPU(s) to train and test neural networks:

import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms


device = torch.device('cuda')
num_epochs = 50
batch_size = 512
learning_rate = 0.01

# define image preprocessing
transform = transforms.Compose([
    transforms.Pad(4),
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(32),
    transforms.ToTensor()])

# download the CIFAR-10 dataset
train_dataset = torchvision.datasets.CIFAR10(root='data/',
                                             train=True,
                                             transform=transform,
                                             download=True)

test_dataset = torchvision.datasets.CIFAR10(root='data/',
                                            train=False,
                                            transform=transforms.ToTensor())

# load data
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=batch_size,
                                           num_workers=4,            # number of subprocesses to use for data loading
                                           pin_memory=True,          # the data loader will copy Tensors into CUDA pinned memory before returning them
                                           prefetch_factor=4,        # number of batches loaded in advance by each worker
                                           persistent_workers=True,  # the data loader will not shutdown the worker processes after a dataset has been consumed once
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                          batch_size=batch_size,
                                          num_workers=4,
                                          pin_memory=True,
                                          prefetch_factor=4,
                                          persistent_workers=True,
                                          shuffle=False)

# 3x3 convolution kernel
def conv3x3(in_channels, out_channels, stride=1):
    return nn.Conv2d(in_channels, out_channels, kernel_size=3,
                     stride=stride, padding=1, bias=False)

# define the residual block
class ResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1, downsample=None):
        super(ResidualBlock, self).__init__()
        self.conv1 = conv3x3(in_channels, out_channels, stride)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv3x3(out_channels, out_channels)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.downsample = downsample

    def forward(self, x):
        residual = x
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)
        out = self.conv2(out)
        out = self.bn2(out)
        if self.downsample:
            residual = self.downsample(x)
        out += residual
        out = self.relu(out)
        return out

# define the structure of ResNet
class ResNet(nn.Module):
    def __init__(self, block, layers, num_classes=10):
        super(ResNet, self).__init__()
        self.in_channels = 16
        self.conv = conv3x3(3, 16)
        self.bn = nn.BatchNorm2d(16)
        self.relu = nn.ReLU(inplace=True)
        self.layer1 = self.make_layer(block, 16, layers[0])
        self.layer2 = self.make_layer(block, 32, layers[1], 2)
        self.layer3 = self.make_layer(block, 64, layers[2], 2)
        self.avg_pool = nn.AvgPool2d(8)
        self.fc = nn.Linear(64, num_classes)

    def make_layer(self, block, out_channels, blocks, stride=1):
        downsample = None
        if (stride != 1) or (self.in_channels != out_channels):
            downsample = nn.Sequential(
                conv3x3(self.in_channels, out_channels, stride=stride),
                nn.BatchNorm2d(out_channels))
        layers = []
        layers.append(block(self.in_channels, out_channels, stride, downsample))
        self.in_channels = out_channels
        for i in range(1, blocks):
            layers.append(block(out_channels, out_channels))
        return nn.Sequential(*layers)

    def forward(self, x):
        out = self.conv(x)
        out = self.bn(out)
        out = self.relu(out)
        out = self.layer1(out)
        out = self.layer2(out)
        out = self.layer3(out)
        out = self.avg_pool(out)
        out = out.view(out.size(0), -1)
        out = self.fc(out)
        return out


model = ResNet(ResidualBlock, [2, 2, 2]).to(device)
# model = nn.DataParallel(model)  # uncomment this line if you have multiple GPUs


# define loss function
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# function for the update of learning rate
def update_lr(optimizer, lr):
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

# train the ResNet
total_step = len(train_loader)
curr_lr = learning_rate
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        images = images.to(device)
        labels = labels.to(device)

        # forward step
        outputs = model(images)
        loss = criterion(outputs, labels)

        # backward step
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # report every 10 steps
        if (i+1) % 10 == 0:
            print ("Epoch [{}/{}], Step [{}/{}] Loss: {:.4f}"
                   .format(epoch+1, num_epochs, i+1, total_step, loss.item()))

    # update learning rate
    if (epoch+1) % 20 == 0:
        curr_lr /= 3
        update_lr(optimizer, curr_lr)

# test the model
model.eval()
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.to(device)
        labels = labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    print('Accuracy of the model on the test images: {} %'.format(100 * correct / total))

Note that if you have multiple GPUs, you need to uncomment the line below the code which creates the model:

# model = nn.DataParallel(model)

You can use this command to monitor the GPU(s) during training:

watch -n 0.2 nvidia-smi

Configure the R environment

Install R

The simplest way to install R >= 4.0 is to run

sudo apt-get install r-base

However, it will not bring you the latest version of R. To get the latest version of R, refer to this website and this offical website. Here I copy the core commands:

# update the package list from repositories
sudo apt update
# install without confirmation
sudo apt install software-properties-common dirmngr -y
# download the R project public key and add it to the trusted list of GPG keys used by apt
wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | sudo tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc
# verify the key; the fingerprint should be E298A3A825C0D65DFD57CBB651716619E084DAB9
gpg --show-keys /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc
# add the CRAN repository for your version of Ubuntu to the list of sources apt uses to install packages
sudo add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/"
# install R and its development packages
sudo apt install r-base r-base-dev -y

Install RStudio

Follow the official installation guide. This should be easier than installing JupyterHub.

Install R packages

As an example, let’s install one of the most prevalent R package in the field of single-cell genomics, Seurat (version 5). Before the installation, you need to install some system-level dependencies first:

sudo apt-get install build-essential libssl-dev libcurl4-openssl-dev libxml2-dev libfontconfig1-dev libharfbuzz-dev libfribidi-dev libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev libhdf5-dev libgsl-dev

Also note that Seurat requires R>=4.4, so you’d better install the latest version of R.

Then the process of installing Seurat should be very smooth:

sudo R

chooseCRANmirror(graphics=FALSE)
install.packages("Seurat")

Additional packages can be installed to enhance the functionality of Seurat. Check the official intallation tutorial of Seurat out. If you intend to install an extremely large R package, you’d better set a longer timeout:

options(timeout=999)
install.packages("<large_package>")

Other useful R packages are:

devtools for package development
tidyverse for geneal data analysis
tidyomics for omics data analysis
ComplexHeatmap for visualizing matrices

install.packages(c("devtools", "tidyverse"))
BiocManager::install(c("tidyomics", "ComplexHeatmap"))

When running devtools::install_github(), you may encounter an error complaining that the API rate limit has been exceeded. The solution to this issue is to create a GitHub token.

usethis::create_github_token()

Run this code in your RStudio console and log in to your GitHub account. Click Settings → Developer settings → Personal access token → Tokens (classic) (if the browser does not automatically direct you to this page) and generate a token. Run

gitcreds::gitcreds_set()

also in your RStudio console (or in the terminal if you are sudo) to add the token. The limit should be relaxed and you can continue the installation, and you can see a message like:

Using GitHub PAT from the git credential store.
Downloading GitHub repo <github_username>/<repo_name>@<branch_name>

Synchronize data

Refer to this website for detailed instructions on how to synchronize data stored on another server.

The key command is

rsync -r /path/to/sync/ <username>@<remote_host>:<destination_directory>

which “pushes” all contents in /path/to/sync/ from the system you are logging in to <destination_directory> in the target system.

If you are synchronizing a large file, you may want to monitor the process:

watch -n <time_interval> du -sh /path/to/large/file

Install some basic fonts

By default, some basic fonts in Windows are not installed in Linux, such as Arial and Times New Roman. These fonts are commonly used in papers and websites, and having them installed can improve the display of figures that expect these fonts to be available. You can install them by:

sudo apt install msttcorefonts
rm -rf ~/.cache/matplotlib

The msttcorefonts package is a collection of TrueType fonts from Microsoft. The second command clears the matplotlib cache directory located in the hidden .cache directory in the user’s home directory.

Troubleshooting

Driver/library version mismatch

When you run nvidia-smi, you may get

Failed to initialize NVML: Driver/library version mismatch

This answer from stackoverflow may help. Briefly you can either reboot or unload the nvidia module. However, if both the ways can’t help, you need to reinstall the nvidia drivers:

sudo apt purge nvidia* libnvidia*
sudo ubuntu-drivers autoinstall

and then sudo reboot your server.

Upgrade Nvidia drivers

You can upgrade the Nvidia driver by these steps:

# clean the installed version
sudo apt purge *nvidia* -y
sudo apt remove *nvidia* -y
sudo rm /etc/apt/sources.list.d/cuda*
sudo apt autoremove -y && sudo apt autoclean -y
sudo rm -rf /usr/local/cuda*

# find recommended driver versions
ubuntu-drivers devices  # or sudo apt search nvidia

# install the lastest version (replace `550` with the latest version number)
sudo apt install libnvidia-common-550-server libnvidia-gl-550-server nvidia-driver-550-server -y

# reboot
sudo reboot now

After reboot, you can check whether the new driver works by nvidia-smi ( although you may be required to also install nvidia-utils-550-server). Theoretically the command nvidia-smi should work, but you may still get an error message

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

even you have installed the latest driver. In this case you can try reinstalling kernel headers:

sudo apt install --reinstall linux-headers-$(uname -r)

If you encounter some errors like cc: error: unrecognized command-line option ‘-ftrivial-auto-var-init=zero’, you can use gcc 12 instead of gcc 11 by

sudo apt-get install gcc-12
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 12

After the headers are reinstalled, you need to sudo reboot the server. Then nvidia-smi should work now.

Can’t use $\LaTeX$ fonts in `matplotlib` figures

According to the demo of matplotlib, you should be able to use $\LaTeX$ fonts by setting

import matplotlib.pyplot as plt

plt.rcParams['text.usetex'] = True

However, if you don’t install $\LaTeX$ in your system, you will get an error:

RuntimeError: Failed to process string with tex because latex could not be found

So just install $\LaTeX$:

sudo apt install texlive texlive-latex-extra texlive-fonts-recommended dvipng cm-super

Now, your server should be well-suited for your bioinformatics research and you know what to do when things go wrong. Enjoy it!

SSH error: Permission denied, please try again

The most possible reason is that the account has been locked. You can check the status by

sudo pam_tally2 --user <user_name>

If the account is indeed locked, reset it by

sudo pam_tally2 --user <user_name> --reset