When your research requires intensive or sustained use of large language models, it may be beneficial to temporarily host a vLLM server on Great Lakes. This approach helps reduce contention caused by multiple users independently loading and serving the same model on shared resources. One example is that multiple users would like to use Qwen model for inference, then they can share the same server to avoid the contention.
Important note: Hosting a vLLM server on Great Lakes is intended only for short-term use (e.g., a few days). The server will be shut down after the usage period, and the allocation should be coordinated with lab members in advance to ensure that it does not negatively impact others’ compute availability. The billing should also be considered.
Host a vLLM Server on Great Lakes
To host a vLLM server on Great Lakes, you need to:
- Create a virtual environment for vLLM. We recommend using
conda, asuvis currently unstable for vLLM on Great Lakes. - Download the model weights to the shared directory:
/nfs/turbo/coe-chaijy-unreplicated/pre-trained-weights/. This ensures consistency and avoids redundant downloads across users. - Submit a Slurm job to host the vLLM server. Make sure you record the hostname of the compute node assigned to your job.
Note: This guide assumes a single-node setup, which is sufficient for most inference workloads. Multi-node configurations are not covered here.
You can obtain the hostname in one of the following ways:
- Run
hostnamedirectly on the node. - Use
squeuewhile the job is running.
Example:
$ squeue -A chaijy0 JOBID PARTITION NAME USER ACCOUNT ST TIME NODES NODELIST(REASON) 38987150 spgpu vllm-ser roihn chaijy0 R 4:19:10 1 gl1509In this example, the hostname is:
gl1509.arc-ts.umich.edu - Run
-
Create an SSH tunnel to forward traffic from a local or lab server to the compute node hosting the vLLM server. We recommend running this command from one of the lab servers (e.g., whistler, aspen) so that other lab members can access the service.
ssh -J <uniqname>@login.itd.umich.edu -N -f \ -L <local_port>:<gl_host_name>:<gl_vllm_port> \ <uniqname>@greatlakes.arc-ts.umich.eduParameters:
<uniqname>: Your UMich uniqname<local_port>: Port on the local or lab server<gl_host_name>: Compute node hostname (e.g.,gl1509.arc-ts.umich.edu)<gl_vllm_port>: vLLM server port on the node (default:8000)
Flags:
-N: Do not execute a remote command-f: Run the SSH tunnel in the background- If omitted, the terminal must remain open to keep the tunnel alive
-L: Enable local port forwarding
To terminate the tunnel, first identify the process using the local port:
lsof -i :<local_port>Example:
$ lsof -i :22003 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME ssh 263790 roihn 4u IPv6 62792077 0t0 TCP ip6-localhost:22003 (LISTEN) ssh 263790 roihn 5u IPv4 62792078 0t0 TCP localhost:22003 (LISTEN)Then terminate the process:
kill <PID>If you do not use
-fwhen creating the ssh tunnel, you can also useCtrl+Cto terminate the process. -
Verify the connection. To confirm that the tunnel is working, run a test script on the local or lab server using:
- Host:
localhost - Port:
<local_port>
If the connection is successful, the script should communicate directly with the vLLM server running on the Great Lakes ompute node.
- Host: