You are here: Home / News & Events / DKRZ Tech Talks / Tech Talk: Pack'em ESMs! Tools for (un-)archiving Earth system model data

Tech Talk: Pack'em ESMs! Tools for (un-)archiving Earth system model data

Archiving large amounts of data files via pftp to the HPSS (tape system) is time consuming and takes some manual work. A new set of command line tools called "packems" was released for all DKRZ users with this Tech Talk. These tools simplify the archival and retrieval of data files to/from the HPSS – including the creation of tar balls and saving information on where which files are stored.
  • Tech Talk: Pack'em ESMs! Tools for (un-)archiving Earth system model data
  • 2020-10-27T15:15:00+01:00
  • 2020-10-27T16:15:00+01:00
  • Archiving large amounts of data files via pftp to the HPSS (tape system) is time consuming and takes some manual work. A new set of command line tools called "packems" was released for all DKRZ users with this Tech Talk. These tools simplify the archival and retrieval of data files to/from the HPSS – including the creation of tar balls and saving information on where which files are stored.
When
Oct 27, 2020 from 03:15 PM to 04:15 PM (Europe/Berlin / UTC100)
Add event to calendar
iCal

Archiving large amounts of data files via pftp to the HPSS (tape system) is time consuming and takes some manual work: (a) users have to aggregate small files in tar balls of 10 GB – 100 GB size; (b) operations with pftp are not handy to automatize. Additionally, the retrieval of small individual files, which have been archived via tar balls in the HPSS, is problematic if users do not remember which files were put into which tar ball. Many users might have experienced these issues.


That's why a new set of command line tools called "packems" was made available for all DKRZ users. These tools will simplify the archival and retrieval of data files to/from the HPSS – including the creation of tar balls and saving information on where which files are stored. The "packems" tool set is a joint development of the MPI-M and the DKRZ.

In this TechTalk, Karl-Hermann Wieners (MPI-M) and Daniel Heydebreck (DKRZ) present this new tool which is now available on DKRZ's mistral system directly via a module.


This TechTalk on YouTube: https://youtu.be/NrwTlWpXi7M

Please find all info on packems at https://code.mpimet.mpg.de/projects/esmenv/wiki/Packems

Code snipplets at https://pad.gwdg.de/s/SyPRpLH_w

Help on Kerberos: https://www.dkrz.de/up/systems/hpss/pftp-with-kerberos (or Kalle’s tapeinit)

For questions regarding packems contact

Q&A:

    • Can multiple files be archived with asterisk path declaration e.g. ‘/work/<somepath>/Scenario*ECHAM5.nc’?
    find /work/<somepath>/Scenario*ECHAM5.nc -printf '%p %s\n' > files.txt
    packems [...] -i files.txt
    or
    find /work/<somepath>/Scenario*ECHAM5.nc -printf '%p %s\n' | packems [...] -i -
      • Would you consider providing these tools to external institutions (like DWD)?
        • The software is BSD-Licensed - take it and run.
      • Can this package also be installed on other HPC
        • You need python3 - we only tested with gnu make and gnu tar. You might have to adapt the scripting away from pftp.
      • Can the packages for archiving be used for automatic archiving in a post processing bash code?
        • yes. It’s meant to be used in your job scripts.
      • Does packems also include some sort of optimized compression, like that of “compresm”?
        • That’s separate. You’ll have to compress beforehand. (e.g. with compresm)
      • Are you documenting these tools and a user guide on the wiki?
      • Is it primarily a command line tool? Is there also a python module to import?
        • Packems/listems/unpackems are command line tools written in python. You should be able to import functions from the package's code. Packems und unpackems generate Makefiles, which contain the actual calls of tar and pftp. Please contact Kalle or Daniel to get details.
      • Is there a possibility to re-scan the archived files if I “lose” the index_list.txt in my home directory somehow?
        • The index_list.txt in ~/.packems contains the locations of INDEX.txt in the HPSS. If you just lose this list file, you can easily re-create it by iterating all your directories at the HPSS and searching for INDEX.txt. The tape system has lists of all files in a project in [YOUR_PROJECT_ID]/_PROJECT.[YOUR_PROJECT_ID].file-list.GIGA files. Fetch those and use grep on them.
        • The individual INDEX.txt files are commonly stored in the HPSS (these files are smalle; hence, not stored on tape but in the HPSS disk cache). On INDEX.txt is stored per directory on the HPSS. If you loose these files, you cannot recreate them without retrieving all archived tar files.
      • Is there any control before archiving files if the size is too small? How long did this development take?
        • No check on minimum sizes. But our tape system will charge you a minimum of 1 GB per uploaded file.
        • see for details on file sizes on HPSS: https://www.dkrz.de/up/systems/hpss/hpss 
        • Roughly half a year. Unpacking was harder than packing.
      • Is there a means for people to access their archived data from a different system than the one it was archived from (e.g. their Workstation)? Guess yes, since I just learned that the Indices live in the Archive.
        • You need the pftp access. It’s only implemented for mistral.
        • You can retrieve all your INDEX.txt files to your local file system and use listems locally to figure out which tarballs you need.
        • Different ftp clients could possibly work, but might cause some issues.
      • is there a direct file size command? also human readable? like byte -> TB
        • use --long (ls -l -style) to get the file size in listems
        • no, we don't have an option like -h for human-readable yet; unfortunately, -h is already reserved for our command line help output
      • is there a dry run possibility of unpackems to see the tar- and pftp-commands?
        • yes, -n or --dry-run (like make)
      • Can I create index files for archives not created with packesm? Would i have to download files to create the index? Yes, some example would be cool! thanks!
        • If you want the details of the content - yes, you have to download the tarballs.
        • You can also just add the whole tarball (actually, you can convert the _PROJECT…. file list from the tape archive into a basic index file). I’m developing a bit of py for this. I’ll add it to the show notes. -- Flo Ziemen
      • Do I have to create the destination path on tape before running packems?
        • No, but you need the (usual) permissions for creating files / directories.
      • Are there any checks after upload? if everything was successful? number of files complete? all files complete (in terms of size)?
        • File sizes are checked.
        • By default, packems will continue after a failed upload use --fail-fast for termination on first failed file.
      • How should a long packems job should be submitted? e.g. via nohup? also possible without computing-resources left in any of the dkrz-groups I am a member of 
        • Use sbatch for batch jobs.
        • packems wrapper can be useful here.
        • You can submit jobs from an “empty” project – They will spend time in the queue but should run.
      •  What will happen with packems when the new HPC is there?
        • The new HSM System will bring new tools that are nicer than pftp.
        • Planning for a port on the new HSM system, especially listems and unpackems => possibility to retrieve archived data.
      • Who is maintaining packems in the future? Will it always be Daniel? Or is it related to a working group?
        • Kalle and Daniel for the foreseeable future.

      Document Actions