Long Term Archiving at DKRZ
Overview of the Long Term Archiving Service at DKRZ
The German Climate Computing Centre (DKRZ: Deutsches Klimarechenzentrum GmbH) provides a Long Term Archiving Service for large data sets which are relevant for climate or earth system research. This service includes archiving and retrieval capability of data for time periods 10 years or longer. The data itself will be stored in a dedicated part for this service on the High Performance Storage System (HPSS). There is an additional copy of the data (double) for security reason.
Data intended for long term archiving have to match the following requirements:
· description of the data (CERA2 datamodel metadata) will be included in the CERA database
· an agreement about access rules (according to the rule of WDC-Climate) is negotiated
· a contact person for all upcoming questions is appointed by the data provider
There are two main storage places for data in the DKRZ structure. During livetime of projects most of the file based project data is stored on disk for better access. File data ist normally stored in the HPSS (High Performance Storage System) where storage media are tapes. Beneath these two types of storage data might exist in the CERA database which is the background store for the World Data Center for Climate (WDC-Climate). For more about the rules please refer to: Rules of the WDC-Climate
All data in the WDCC are described for search and download by metadata which are based on the CERA2 data model. The CERA database is the implemented 'data store' of the WDCC. If the data is stored in files only pointer are stored in the database. This is the case mostly for complete free available data. If the data are inside the database they are stored in containers. The container format is a self designed storage format which allows a more fine granularity of access rights.
Climate data are very often output data of climate models. These data are produced by modeling centers all over the world. The second growing part of data are observational data. They are the outcome of dedicated observational projects or they are the continuous output of observational station like satellites, weather stations or others.
The model output data are normally organized as monthly files with all variables in any file. Observational data are always time series with one or more variables from a dedicated station. When model data are stored in the WDCC container format the data will be preprocessed into time series on a dedicated level. Observational data can be grouped in space and time for better access granularity.
The WDCC (World Data Center for Climate) is a central place where climate data are stored.
Common for all data is that they must be described by metadata. These metadata are also held in the CERA database. Only if metadata are available data can be searched or downloaded.
Metadata creation for data providers is supported by a GUI: http://cera-www.dkrz.de/LTA_metadata
Metadata are public and can be obtained without any login from the CERA WWW Gateway.
To store the model output data in the WDCC it is recommended to pre-process the data into time series. The deeper meaning is to prepare the data for download in a better granularity for the scientist or researcher and thus to minimize the transfers size of the data.
Metadata and data are available through a comfortable user interface: http://cera-www.dkrz.de/.
For downloads a login to the CERA database is needed.
The DKRZ provides a (Java based) download tool (jblob, link to...) which is designed to download data with the help of a GUI (WDCC GUI) or direct from command line.
The DKRZ department 'Data management” (formerly: Model & Data) will guide and support the long term archiving process especially for:
- creating metadata in a format which can be easily inserted into the CERA database
- data pre-processing with respect to CERA data formats and the CERA2 data model
- data storing formats (files vs. database container) to fit best data access granularity
- quality assurance of metadata and data itself
- supply the data with a Digital Object Identifier (DOI; additional option with an extra fee)
The long term archiving service is open for:
1. DKRZ Users:
DKRZ is the host for climate community projects. During project run-time or at the end of a project when data shall remain for scientific or public use
A "how to" is available here: How to use the LTA service for a DKRZ-User
2. External projects or users:
Users with external projects when the project is related to earth system research
A "how to" is available here: Access for external Users
Long term archiving is liable to costs for external projects.
The fees for an external project arise for:
1. Creating and inclusion of meta data into the CERA data base and in cooperation the data sets into the data base system (personal costs)
2. Costs for storage media (several tape generations)
3. running costs for HSM-System operation, data base and internet access
4. optional: assignment of Digital Object Identifiers (DOI)
Example cost estimation (in German only) can be found here.
DOI (Digital Object Identifier)
After describing, pre-processing and uploading the data a last step in the chain of archiving could be to assing a DOI to the data. Therefore the quality of the data itself and quality of the metadata must be checked extra. After passing the examination a DOI can be applied to the data. This is also part of the LTA service offered by the DKRZ.
For the description of data the CERA2 Metadata model is used. It allows an extensive description of the data. The metadata (data description) are public and accessible without any login.
Data from the long term archive is available for download without any additional costs. The only restriction is that the data access itself must be controlled by a required user account for data base log in (free of charge) and an access permission (usually 'public' access).
A work-flow overview picture is available here: work flow