You are here: Home / Systems / Mistral / Preparing Data for Archiving

Preparing Data for Archiving

After postprocessing and analyzing your output data you may want to archive parts of your data to free up disk space for new model runs. DKRZ's HPSS tape archive is well equipped to handle large amounts of data going into the archive and also retrieval back to the Lustre file system on Mistral. However, you should follow a few guidelines for efficient use which ensure that you can retrieve the data almost as fast as you can write them into the archive.

File size

It is important to bundle smaller files which belong together. I.e. files you are likely to need at the same time when you later want to retrieve them from tape. If you do not bundle them, they can end up on separate tape cartridges which makes retrieval very time-consuming. To encourage this type of usage, we account at least 1 GB per file on tape no matter how small it is.

On the other hand it does not make sense to create needlessly large archive files where you later may only be interested in a small portion of the data.

Compression

Compression can save storage space on tape and disk but requires CPU time to compress and decompress the data. It can also make processing workflows more complex. Ideally, you should use a data format which already supports compression like GRIB or NetCDF. If you have to use file formats that don't support compression or if you don't want to enable compression in the data format then consider the following.

Gzip

The tar command supports several compression methods just by selecting them with command line switches. A very popular method is gzip.

tar czvf archive.tar.gz data

This bundles all files in the data directory into the file archive.tar.gz which is compressed with gzip. The advantage of gzip is that most users will know how to handle .tar.gz files. A disadvantage is that the compression ratio could be better and that tar only uses one CPU core for compression which makes it very slow for large amounts of data.

Pigz

Pigz creates compressed files which can be uncompressed with gzip but utilizes many CPU cores at once to speed up the compression process.

module load pigz
tar cvf - data | pigz > archive.tar.gz

The compressed file will be about the same size as with gzip but on a mistral node you will have your result about six times faster.

7z

The 7z format supports different compression methods but the default method is the very efficient albeit CPU intensive LZMA.

tar cvf - data | 7z a -si archive.tar.7z

7z produces smaller files than gzip while still being 3.5 times faster than gzip. It achieves this by utilizing many CPU cores in a mistral node.

Comparison

We compressed 166 GiB of uncompressed model input data files in NetCDF format.

MethodSize [GiB]real time [s]user + sys time [s]
gzip 14.03 2341 2601
pigz 13.93 418 6515
7z 10.84 672 22853

These measurements can only be a rough guideline because much depends on your type of data.

Takeaway points:

  • Parallel methods (pigz, 7z) can speed up compression
  • LZMA (7z) needs a lot of accounted compute power which may not be justified by the smaller output

Document Actions