🌈 The EM27 Retrieval Pipeline 1.0.0 has been released
πŸ“š Guides
Using the Pipeline

Using the Pipeline

Setting up the Pipeline

Get the code

  1. Clone the repository (opens in a new tab) or download the source code of a specific release (opens in a new tab).
  2. Create a virtual environment using python3.11 -m venv .venv
  3. Always activate the virtual environment using source .venv/bin/activate
  4. Install the dependencies using pip install ".[dev]" or pdm sync --group dev
πŸ’‘

We highly recommend using PDM (opens in a new tab) - a Python package manager - over pip. PDM uses a pdm.lock file - next to the pyproject.toml file - to pin the dependencies to specific versions. pdm sync installs the exact same dependencies on every system, hence errors are less likely not to be reproducible.

In our chair, we moved many projects from Poetry (opens in a new tab) to PDM (opens in a new tab) because Poetry uses its own pyproject.toml schema in contrast to PDM, which uses PEP standards (PEP 508 (opens in a new tab), PEP 517 (opens in a new tab), PEP 621 (opens in a new tab)). Pip or any other tool can work with these PEP standards, but if Poetry has an issue, you cannot just use another tool - this happened to us a few times.

Configure it on your system

  1. Install System Dependencies: e.g. sudo apt install unzip gfortran
  2. Use the config/config.template.json to create a config.json file.
  3. Set up the metadata locally or remotely (see metadata guide)

Run the tests to ensure everything is working

  1. Test the pipeline using pytest -m "integration or quick or ci" --verbose --exitfirst tests/
  2. Test the actual retrieval using pytest -m "complete" --verbose --exitfirst tests/

Celebrate if the tests pass

  1. Run curl parrot.live

Use the CLI to run the pipeline

The following sections describe how to trigger the different pipeline processes using the CLI. Whenever calling the CLI, it will validate the integrity of your local config.json file and output a message if your system is not correctly configured.

CLI Config Validation

Downloading Atmospheric Profiles

This pipeline automated the way of obtaining atmospheric profiles described in https://tccon-wiki.caltech.edu/Main/ObtainingGinputData (opens in a new tab). Add the email address to access the FTP server to config.profiles.server.email and configure the scope you want to download in config.profiles.scope.

Run the following command to download all the profiles:

python cli.py profiles run

This script will use the metadata and the configured local profiles directory to determine which profiles to request. Every time you call this script, it will request the profiles it has not already requested and check for the results of ongoing requests.

This process ensures that only config.profiles.server.max_parallel_requests are running simultaneously. It only requests the same profiles again if they have not been generated within 24 hours. The script can download partial query results (e.g. if only days 1 to 5 of a 7-day request could be fulfilled).

You can use config.profiles.GGG2020_standard_sites to configure a list of standard sites you want to download. The script will never request profiles for these standard sites but only download the pre-generated data.

Run the following to request the current queue status of your account:

python cli.py profiles request-ginput-status

Running Retrievals

Use the following commands to start the retrievals in a background process.

python cli.py retrieval start

You can limit the number of cores used by the retrieval process using config.retrievals.general.max_process_count.

Using the following commands, you can check whether the retrievals are still running and open a dashboard to monitor the progress.

python cli.py retrieval is-running
python cli.py retrieval watch
Retrieval CLI Watch

Terminate the ongoing retrievals using the following command:

python cli.py retrieval stop

Bundling All Retrieval Outputs

Bundle all the retrieval outputs using the following command:

python cli.py bundle run

Generate a Data Report

You can generate a report about the data on your system using the following command:

python cli.py data-report

This script will produce one CSV file per sensor ID in the directory data/reports/:

from_datetime,to_datetime,location_id,interferograms,ground_pressure,ggg2014_profiles,ggg2014_proffast_10_outputs,ggg2014_proffast_22_outputs,ggg2014_proffast_23_outputs,ggg2020_profiles,ggg2020_proffast_22_outputs,ggg2020_proffast_23_outputs
2023-09-07T00:00:00+0000,2023-09-07T23:59:59+0000,   TUM_I, 2224, 1440,βœ…,-,βœ…,βœ…,βœ…,-,βœ…
2023-09-08T00:00:00+0000,2023-09-08T23:59:59+0000,   TUM_I, 2178, 1440,βœ…,-,βœ…,βœ…,βœ…,-,βœ…
2023-09-09T00:00:00+0000,2023-09-09T23:59:59+0000,   TUM_I, 1966, 1440,βœ…,-,βœ…,βœ…,βœ…,-,βœ…
2023-09-10T00:00:00+0000,2023-09-10T23:59:59+0000,   TUM_I, 2034, 1440,βœ…,-,βœ…,βœ…,βœ…,-,βœ…
2023-09-11T00:00:00+0000,2023-09-11T23:59:59+0000,   TUM_I, 2122, 1440,βœ…,-,βœ…,βœ…,βœ…,-,βœ…
2023-09-12T00:00:00+0000,2023-09-12T23:59:59+0000,   TUM_I, 1972, 1440,βœ…,-,βœ…,βœ…,βœ…,-,βœ…
2023-09-13T00:00:00+0000,2023-09-13T23:59:59+0000,   TUM_I,  216, 1439,βœ…,-,βœ…,βœ…,βœ…,-,βœ…
2023-09-14T00:00:00+0000,2023-09-14T23:59:59+0000,   TUM_I,  762, 1440,βœ…,-,βœ…,βœ…,βœ…,-,βœ…
2023-09-15T00:00:00+0000,2023-09-15T23:59:59+0000,   TUM_I, 1507, 1440,βœ…,-,βœ…,βœ…,βœ…,-,βœ…
2023-09-16T00:00:00+0000,2023-09-16T23:59:59+0000,   TUM_I, 2232, 1440,βœ…,-,βœ…,βœ…,βœ…,-,βœ…
2023-09-17T00:00:00+0000,2023-09-17T23:59:59+0000,   TUM_I, 1599, 1440,βœ…,-,βœ…,βœ…,βœ…,-,βœ…
2023-09-18T00:00:00+0000,2023-09-18T23:59:59+0000,   TUM_I,  228, 1440,βœ…,-,βœ…,βœ…,βœ…,-,βœ…

The numbers in the columns "interferograms" and "ground_pressure" are the number of interferograms and the number of ground_pressure lines for the respective day.