Archive and Purge
Storage has gotten cheaper, but it is not free, e.g., Turbo, so we should be diligent about archiving and purging. Generally, when a student leaves a lab, e.g., graduation, we should either archive their data on Data Den or completely purge them.
Furthermore, we can archive or purge data that have not been accessed for a while. You can use GUFI to get data on this. This typically takes a long time, so it’s a good idea to run it as a Slurm job. Here’s an example Slurm script that runs GUFI on Turbo.
#!/bin/bash
### DO NOT USE GPU NODES!
#SBATCH --partition=standard
#SBATCH --time=02-00:00:00
#SBATCH --job-name=gufi
#SBATCH --mail-user=<your-email@umich.edu>
#SBATCH --mail-type=BEGIN,END
### Totally fine to use chaijy0
#SBATCH --account=chaijy0
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=16GB
#SBATCH --output=/home/%u/%x-%j.log
module load singularity
singularity exec gufi_master.sif gufi_dir2index -n 8 /nfs/turbo/coe-chaijy/ /scratch/chaijy_root/chaijy0/<uniqname>/GUFI
Once this is done, you can run various reporting commands:
summary.sh
#!/bin/bash
### DO NOT USE GPU NODES!
#SBATCH --partition=standard
#SBATCH --time=02-00:00:00
#SBATCH --job-name=gufi-summary
#SBATCH --mail-user=<your-email@umich.edu>
#SBATCH --mail-type=BEGIN,END
### Totally fine to use chaijy0
#SBATCH --account=chaijy0
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=16GB
#SBATCH --output=/home/%u/%x-%j.log
module load singularity
singularity exec --bind /etc/passwd gufi_master.sif summary.sh /scratch/chaijy_root/chaijy0/<uniqname>/GUFI/coe-chaijy/ 180
dirsum.sh
#!/bin/bash
### DO NOT USE GPU NODES!
#SBATCH --partition=standard
#SBATCH --time=02-00:00:00
#SBATCH --job-name=gufi-dirsum
#SBATCH --mail-user=<your-email@umich.edu>
#SBATCH --mail-type=BEGIN,END
### Totally fine to use chaijy0
#SBATCH --account=chaijy0
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=2GB
#SBATCH --output=/home/%u/gufi-dirsum/%x-%j.log
module load singularity
# Define the array of directory names
DIR_NAMES=('dir1' 'dir2')
# Run the command in parallel for each directory in the array
for DIR_NAME in "${DIR_NAMES[@]}"; do
# Create a folder for the parallel job
mkdir -p /home/<uniqname>/gufi-dirsum/$DIR_NAME
salloc -N1 -n1 --cpus-per-task=4 --mem-per-cpu=16GB --account=chaijy0 \
srun --chdir=/home/<uniqname>/gufi-dirsum/$DIR_NAME --output=/home/<uniqname>/gufi-dirsum/$DIR_NAME/${DIR_NAME}-%j.log \
singularity exec --bind /etc/passwd gufi_master.sif dirsum.sh /scratch/chaijy_root/chaijy0/<uniqname>/GUFI/coe-chaijy/$DIR_NAME 180 &
done
# Wait for all parallel tasks to complete
wait
Once you’ve determined which data to archive, you can use archivetar to do so. See Data Den for more details.
Hand’s on Tutorial
Here is the step-by-step tutorial for exact storage management process. You may familiarize yourself with data-den first before you move on.