You are here: Home / Services / Data Management / DKRZ CMIP Data Pool (DKRZ CDP)

DKRZ CMIP Data Pool (DKRZ CDP)

Motivation

The DKRZ CMIP data pool (DKRZ CDP) provides often needed collections of climate model data in the context of climate model intercomparison and evaluation projects. This pool is hosted as part of the DKRZ data infrastructure and wants to support user groups in high volume climate data collection, access and processing. The hosted data concentrates on model data generated as part of larger climate model intercomparison projects e.g. CMIP and CORDEX

Furthermore, DKRZ provides support for IPCC authors by making the DKRZ CDP and the in-house technical infrastructure available in the framework of the DDC support activity, where DDC support consists of the IPCC DDC, in form of the local IPCC DDC activities, and IPCC Working Group Technical Support Units (WG TSU's). 

Registration

  • To access the pool data, the user needs to be register at DKRZ, see "Register a new website account."

  • After registration users need to join specific groups to assign them with personal storage and compute resources. Examples of such groups are:

    • DKRZ_MIP_POOL_Analysis:  to support IPCC related data analysis activities

    • ECAS: to support data analysis activities as part of the EOSCHUB project

    • DICAD: to support data analysis activities in the German DICAD consortium

    • IS-ENES: (starting early 2019) European service activity to support data analysis activities

  • Registered users can contact [Email protection active, please enable JavaScript.] to assign them to a specific group allowing them the use of different types of resources (e.g. interactive login nodes, virtual machines, HPC batch compute)

Data pool content and replication of data available at external ESGF nodes

The overall data pool provides space for around 5 PBytes of data stored as part of the DKRZ HPC Lustre file system. A dedicated board decides on the prioritization of data storage based on end user requirements as well as international agreements (e.g. with respect to data replication).  Based on current estimations around 2 PByte are reserved for providing replicated CMIP data from other ESGF data nodes around the world. Around 100 TByte are currently reserved for storing derived data products.

In case you miss some ESGF CMIP(6) data available at other ESGF nodes you can request a data replication by contacting [Email protection active, please enable JavaScript.]. The specification of the specific data collection you want to be replicated and made locally accessible is based on synda selection files ( http://prodiguer.github.io/synda/ ).

In case you have requirements with respect to storage of derived data products or the inclusion of non-ESGF accessible data sets please contact [Email protection active, please enable JavaScript.].

Data access from pool associated resources

The data pool is efficiently accessible from HPC resources as well as virtual machines:

  • HPC resources:

See https://www.dkrz.de/up/my-dkrz/getting-started/getting-started-at-dkrz for an introduction

  • Users can directly login into front end nodes of the DKRZ HPC system (“login nodes”)  and run interactive processing scripts. The data pool is directly accessible under local directory paths (directly mounted):

    • /work/kd0956 holds around 1.2 PBytes of CMIP5 data and CORDEX data (among others)

      • /work/kd0956/External provides access to ESGF-published data which is not part of large global inter comparison projects, e.g. the MPI-M Grand Ensemble dataset
    • /work/ik1017 will provide access to around 3 PBytes of CMIP6 data

  • compute intensive parallel data analysis is supported by the submission of batch compute jobs to the DKRZ HPC computer. The pool data is accessible from all compute nodes. The need for such compute intensive compute loads needs to be specified and announced around a half year before this can be provisioned. The request is reviewed by a reviewer council and the amount of cpu time as well as storage assigned is dependent on the council vote.

  • Virtual machines: Users can request virtual machines with direct access to the data pool. Requests need to be sent to [Email protection active, please enable JavaScript.]. These requests are evaluated and the resources are granted based on the result of this evaluation. Users can then directly log into these virtual machines and install and run their analysis codes.

For an overview of the pre-installed libraries and tools please refer to

Data access from remote resources

Data from the data pool can be replicated to remote, e.g. institutional, resources using different methods:

Data search

There are different possibilities to search for data items in the data pool:

  • Direct searching on the posix directory tree using your prefered scripting or programming language (bash, python, …)

  • Use the search index on the login nodes

    • Log into the interactive nodes at DKRZ

    • Module load cmip6-dicad/1.0

    • freva --databrowser --help

  • Use the web accessible search index:

  • Use the DKRZ ESGF portal

Technical information with respect to the DKRZ CMIP Data Pool

Data store:

  • A ~5 PByte disk pool, hosted at DKRZ as part of the DKRZ HPC Lustre storage system

Compute resources:

  • Dedicated compute servers  

  • Virtual machines

User support

 

Document Actions