Using the Pipeline

Setting up the Pipeline

Get the code

Clone the repository (opens in a new tab) or download the source code of a specific release (opens in a new tab).
Create a virtual environment using python3.11 -m venv .venv
Always activate the virtual environment using source .venv/bin/activate
Install the dependencies using pip install ".[dev]" or pdm sync --group dev

💡

We highly recommend using PDM (opens in a new tab) - a Python package manager - over pip. PDM uses a pdm.lock file - next to the pyproject.toml file - to pin the dependencies to specific versions. pdm sync installs the exact same dependencies on every system, hence errors are less likely not to be reproducible.

In our chair, we moved many projects from Poetry (opens in a new tab) to PDM (opens in a new tab) because Poetry uses its own pyproject.toml schema in contrast to PDM, which uses PEP standards (PEP 508 (opens in a new tab), PEP 517 (opens in a new tab), PEP 621 (opens in a new tab)). Pip or any other tool can work with these PEP standards, but if Poetry has an issue, you cannot just use another tool - this happened to us a few times.

Configure it on your system

Install System Dependencies: e.g. sudo apt install unzip gfortran
Use the config/config.template.json to create a config.json file.
Set up the metadata locally or remotely (see metadata guide)

Run the tests to ensure everything is working

Test the pipeline using pytest -m "integration or quick or ci" --verbose --exitfirst tests/
Test the actual retrieval using pytest -m "complete" --verbose --exitfirst tests/

Celebrate if the tests pass

Run curl parrot.live

Use the CLI to run the pipeline

The following sections describe how to trigger the different pipeline processes using the CLI. Whenever calling the CLI, it will validate the integrity of your local config.json file and output a message if your system is not correctly configured.

Downloading Atmospheric Profiles

This pipeline automated the way of obtaining atmospheric profiles described in https://tccon-wiki.caltech.edu/Main/ObtainingGinputData (opens in a new tab). Add the email address to access the FTP server to config.profiles.server.email and configure the scope you want to download in config.profiles.scope.

Run the following command to download all the profiles:

python cli.py profiles run

This script will use the metadata and the configured local profiles directory to determine which profiles to request. Every time you call this script, it will request the profiles it has not already requested and check for the results of ongoing requests.

This process ensures that only config.profiles.server.max_parallel_requests are running simultaneously. It only requests the same profiles again if they have not been generated within 24 hours. The script can download partial query results (e.g. if only days 1 to 5 of a 7-day request could be fulfilled).

You can use config.profiles.GGG2020_standard_sites to configure a list of standard sites you want to download. The script will never request profiles for these standard sites but only download the pre-generated data.

Run the following to request the current queue status of your account:

python cli.py profiles request-ginput-status

Running Retrievals

Use the following commands to start the retrievals in a background process.

python cli.py retrieval start

You can limit the number of cores used by the retrieval process using config.retrievals.general.max_process_count.

Using the following commands, you can check whether the retrievals are still running and open a dashboard to monitor the progress.

python cli.py retrieval is-running
python cli.py retrieval watch

Terminate the ongoing retrievals using the following command:

python cli.py retrieval stop

Bundling All Retrieval Outputs

Bundle all the retrieval outputs using the following command:

python cli.py bundle run

Generating GEOMS compliant HDF5 files

The Generic Earth Observation Metadata Standard (GEOMS) (opens in a new tab) is a standard for exchanging ground-based total-column concentration data. It uses the HDF5 file format (opens in a new tab) and enforces a specific structure for the data (see guidelines from EVDC (opens in a new tab)).

This pipeline can generate GEOMS compliant HDF5 files for the entire retrieval output. For that it requires the files calibration_factors.json and geoms_metadata.json to be present in the configuration directory. You can use the example files and the API reference to create your own files. You need to add a section geoms to the general config.json file defining the scope for which you want to generate GEOMS files.

As always, the pipeline will tell you if any of your configuration files are invalid. Create the GEOMS files using the following command:

python cli.py geoms run

The logs will report, which files have been generated:

Config is valid
Loading configuration
Loading calibration factors
Loading geoms metadata
Processing proffast-2.4/GGG2020
Processing sensor id "ma"
Sensor ma: found 1105 results in total
Sensor ma: found 173 results within the time range
  ma/20240501: Generated .../proffast-2.4/GGG2020/ma/successful/20240501/groundbased_ftir.coccon_tum.esm061_munich.tum_20240501t114016z_20240501t171933z_001.h5
  ma/20240502: Generated .../proffast-2.4/GGG2020/ma/successful/20240502/groundbased_ftir.coccon_tum.esm061_munich.tum_20240502t070418z_20240502t164544z_001.h5
  ma/20240504: Generated .../proffast-2.4/GGG2020/ma/successful/20240504/groundbased_ftir.coccon_tum.esm061_munich.tum_20240504t053204z_20240504t171926z_001.h5
  ma/20240505: Generated .../proffast-2.4/GGG2020/ma/successful/20240505/groundbased_ftir.coccon_tum.esm061_munich.tum_20240505t051558z_20240505t172421z_001.h5
  ma/20240506: Generated .../proffast-2.4/GGG2020/ma/successful/20240506/groundbased_ftir.coccon_tum.esm061_munich.tum_20240506t052157z_20240506t125438z_001.h5
  ma/20240507: Not enough data (less than 11 datapoints)
  ma/20240509: Generated .../proffast-2.4/GGG2020/ma/successful/20240509/groundbased_ftir.coccon_tum.esm061_munich.tum_20240509t072631z_20240509t152931z_001.h5
  ma/20240510: Generated .../proffast-2.4/GGG2020/ma/successful/20240510/groundbased_ftir.coccon_tum.esm061_munich.tum_20240510t061155z_20240510t083149z_001.h5
ma/20240511:  5%|██████                                                                                                            | 8/173 [00:19<03:07, 1.20s/it]

You can verify the integrity of these HDF5 files using the AVDC's Quality Assurance Tool (opens in a new tab).

🐞

The code to generate these GEOMS compliant HDF5 files has been adapted from the PROFFASTpylot (opens in a new tab). The authors of the original code are Lena Feld, Benedikt Herkommer, Darko Dubravica affiliated with the Karlsruhe Institut of Technology (KIT).

Please make sure to not only cite this pipeline, but also the Proffast Pylot when using this feature.

Generate a Data Report

You can generate a report about the data on your system using the following command:

python cli.py data-report

This script will produce one CSV file per sensor ID in the directory data/reports/:

from_datetime,to_datetime,location_id,interferograms,ground_pressure,ggg2014_profiles,ggg2014_proffast_10_outputs,ggg2014_proffast_22_outputs,ggg2014_proffast_23_outputs,ggg2020_profiles,ggg2020_proffast_22_outputs,ggg2020_proffast_23_outputs
2023-09-07T00:00:00+0000,2023-09-07T23:59:59+0000,   TUM_I, 2224, 1440,✅,-,✅,✅,✅,-,✅
2023-09-08T00:00:00+0000,2023-09-08T23:59:59+0000,   TUM_I, 2178, 1440,✅,-,✅,✅,✅,-,✅
2023-09-09T00:00:00+0000,2023-09-09T23:59:59+0000,   TUM_I, 1966, 1440,✅,-,✅,✅,✅,-,✅
2023-09-10T00:00:00+0000,2023-09-10T23:59:59+0000,   TUM_I, 2034, 1440,✅,-,✅,✅,✅,-,✅
2023-09-11T00:00:00+0000,2023-09-11T23:59:59+0000,   TUM_I, 2122, 1440,✅,-,✅,✅,✅,-,✅
2023-09-12T00:00:00+0000,2023-09-12T23:59:59+0000,   TUM_I, 1972, 1440,✅,-,✅,✅,✅,-,✅
2023-09-13T00:00:00+0000,2023-09-13T23:59:59+0000,   TUM_I,  216, 1439,✅,-,✅,✅,✅,-,✅
2023-09-14T00:00:00+0000,2023-09-14T23:59:59+0000,   TUM_I,  762, 1440,✅,-,✅,✅,✅,-,✅
2023-09-15T00:00:00+0000,2023-09-15T23:59:59+0000,   TUM_I, 1507, 1440,✅,-,✅,✅,✅,-,✅
2023-09-16T00:00:00+0000,2023-09-16T23:59:59+0000,   TUM_I, 2232, 1440,✅,-,✅,✅,✅,-,✅
2023-09-17T00:00:00+0000,2023-09-17T23:59:59+0000,   TUM_I, 1599, 1440,✅,-,✅,✅,✅,-,✅
2023-09-18T00:00:00+0000,2023-09-18T23:59:59+0000,   TUM_I,  228, 1440,✅,-,✅,✅,✅,-,✅

The numbers in the columns "interferograms" and "ground_pressure" are the number of interferograms and the number of ground_pressure lines for the respective day.

Metadata Automatic Test Suite