Using the Pipeline
Setting up the Pipeline
Get the code
- Clone the repository (opens in a new tab) or download the source code of a specific release (opens in a new tab).
- Create a virtual environment using
python3.11 -m venv .venv
- Always activate the virtual environment using
source .venv/bin/activate
- Install the dependencies using
pip install ".[dev]"
orpdm sync --group dev
We highly recommend using PDM (opens in a new tab) - a Python package manager - over pip. PDM uses a pdm.lock
file - next to the pyproject.toml
file - to pin the dependencies to specific versions. pdm sync
installs the exact same dependencies on every system, hence errors are less likely not to be reproducible.
In our chair, we moved many projects from Poetry (opens in a new tab) to PDM (opens in a new tab) because Poetry uses its own pyproject.toml
schema in contrast to PDM, which uses PEP standards (PEP 508 (opens in a new tab), PEP 517 (opens in a new tab), PEP 621 (opens in a new tab)). Pip or any other tool can work with these PEP standards, but if Poetry has an issue, you cannot just use another tool - this happened to us a few times.
Configure it on your system
- Install System Dependencies: e.g.
sudo apt install unzip gfortran
- Use the
config/config.template.json
to create aconfig.json
file. - Set up the metadata locally or remotely (see metadata guide)
Run the tests to ensure everything is working
- Test the pipeline using
pytest -m "integration or quick or ci" --verbose --exitfirst tests/
- Test the actual retrieval using
pytest -m "complete" --verbose --exitfirst tests/
Celebrate if the tests pass
- Run
curl parrot.live
Use the CLI to run the pipeline
The following sections describe how to trigger the different pipeline processes using the CLI. Whenever calling the CLI, it will validate the integrity of your local config.json
file and output a message if your system is not correctly configured.
Downloading Atmospheric Profiles
This pipeline automated the way of obtaining atmospheric profiles described in https://tccon-wiki.caltech.edu/Main/ObtainingGinputData (opens in a new tab). Add the email address to access the FTP server to config.profiles.server.email
and configure the scope you want to download in config.profiles.scope
.
Run the following command to download all the profiles:
python cli.py profiles run
This script will use the metadata and the configured local profiles directory to determine which profiles to request. Every time you call this script, it will request the profiles it has not already requested and check for the results of ongoing requests.
This process ensures that only config.profiles.server.max_parallel_requests
are
running simultaneously. It only requests the same profiles again if they have not been generated within 24 hours. The script can download partial query results (e.g. if only days 1 to 5 of a 7-day request could be fulfilled).
You can use config.profiles.GGG2020_standard_sites
to configure a list of standard sites you want to download. The script will never request profiles for these standard sites but only download the pre-generated data.
Run the following to request the current queue status of your account:
python cli.py profiles request-ginput-status
Running Retrievals
Use the following commands to start the retrievals in a background process.
python cli.py retrieval start
You can limit the number of cores used by the retrieval process using config.retrievals.general.max_process_count
.
Using the following commands, you can check whether the retrievals are still running and open a dashboard to monitor the progress.
python cli.py retrieval is-running
python cli.py retrieval watch
Terminate the ongoing retrievals using the following command:
python cli.py retrieval stop
Bundling All Retrieval Outputs
Bundle all the retrieval outputs using the following command:
python cli.py bundle run
Generating GEOMS compliant HDF5 files
The Generic Earth Observation Metadata Standard (GEOMS) (opens in a new tab) is a standard for exchanging ground-based total-column concentration data. It uses the HDF5 file format (opens in a new tab) and enforces a specific structure for the data (see guidelines from EVDC (opens in a new tab)).
This pipeline can generate GEOMS compliant HDF5 files for the entire retrieval output. For that it requires the files calibration_factors.json
and geoms_metadata.json
to be present in the configuration directory. You can use the example files and the API reference
to create your own files. You need to add a section geoms
to the general config.json
file defining
the scope for which you want to generate GEOMS files.
As always, the pipeline will tell you if any of your configuration files are invalid. Create the GEOMS files using the following command:
python cli.py geoms run
The logs will report, which files have been generated:
Config is valid
Loading configuration
Loading calibration factors
Loading geoms metadata
Processing proffast-2.4/GGG2020
Processing sensor id "ma"
Sensor ma: found 1105 results in total
Sensor ma: found 173 results within the time range
ma/20240501: Generated .../proffast-2.4/GGG2020/ma/successful/20240501/groundbased_ftir.coccon_tum.esm061_munich.tum_20240501t114016z_20240501t171933z_001.h5
ma/20240502: Generated .../proffast-2.4/GGG2020/ma/successful/20240502/groundbased_ftir.coccon_tum.esm061_munich.tum_20240502t070418z_20240502t164544z_001.h5
ma/20240504: Generated .../proffast-2.4/GGG2020/ma/successful/20240504/groundbased_ftir.coccon_tum.esm061_munich.tum_20240504t053204z_20240504t171926z_001.h5
ma/20240505: Generated .../proffast-2.4/GGG2020/ma/successful/20240505/groundbased_ftir.coccon_tum.esm061_munich.tum_20240505t051558z_20240505t172421z_001.h5
ma/20240506: Generated .../proffast-2.4/GGG2020/ma/successful/20240506/groundbased_ftir.coccon_tum.esm061_munich.tum_20240506t052157z_20240506t125438z_001.h5
ma/20240507: Not enough data (less than 11 datapoints)
ma/20240509: Generated .../proffast-2.4/GGG2020/ma/successful/20240509/groundbased_ftir.coccon_tum.esm061_munich.tum_20240509t072631z_20240509t152931z_001.h5
ma/20240510: Generated .../proffast-2.4/GGG2020/ma/successful/20240510/groundbased_ftir.coccon_tum.esm061_munich.tum_20240510t061155z_20240510t083149z_001.h5
ma/20240511: 5%|ββββββ | 8/173 [00:19<03:07, 1.20s/it]
You can verify the integrity of these HDF5 files using the AVDC's Quality Assurance Tool (opens in a new tab).
The code to generate these GEOMS compliant HDF5 files has been adapted from the PROFFASTpylot (opens in a new tab). The authors of the original code are Lena Feld, Benedikt Herkommer, Darko Dubravica affiliated with the Karlsruhe Institut of Technology (KIT).
Please make sure to not only cite this pipeline, but also the Proffast Pylot when using this feature.
Generate a Data Report
You can generate a report about the data on your system using the following command:
python cli.py data-report
This script will produce one CSV file per sensor ID in the directory data/reports/
:
from_datetime,to_datetime,location_id,interferograms,ground_pressure,ggg2014_profiles,ggg2014_proffast_10_outputs,ggg2014_proffast_22_outputs,ggg2014_proffast_23_outputs,ggg2020_profiles,ggg2020_proffast_22_outputs,ggg2020_proffast_23_outputs
2023-09-07T00:00:00+0000,2023-09-07T23:59:59+0000, TUM_I, 2224, 1440,β
,-,β
,β
,β
,-,β
2023-09-08T00:00:00+0000,2023-09-08T23:59:59+0000, TUM_I, 2178, 1440,β
,-,β
,β
,β
,-,β
2023-09-09T00:00:00+0000,2023-09-09T23:59:59+0000, TUM_I, 1966, 1440,β
,-,β
,β
,β
,-,β
2023-09-10T00:00:00+0000,2023-09-10T23:59:59+0000, TUM_I, 2034, 1440,β
,-,β
,β
,β
,-,β
2023-09-11T00:00:00+0000,2023-09-11T23:59:59+0000, TUM_I, 2122, 1440,β
,-,β
,β
,β
,-,β
2023-09-12T00:00:00+0000,2023-09-12T23:59:59+0000, TUM_I, 1972, 1440,β
,-,β
,β
,β
,-,β
2023-09-13T00:00:00+0000,2023-09-13T23:59:59+0000, TUM_I, 216, 1439,β
,-,β
,β
,β
,-,β
2023-09-14T00:00:00+0000,2023-09-14T23:59:59+0000, TUM_I, 762, 1440,β
,-,β
,β
,β
,-,β
2023-09-15T00:00:00+0000,2023-09-15T23:59:59+0000, TUM_I, 1507, 1440,β
,-,β
,β
,β
,-,β
2023-09-16T00:00:00+0000,2023-09-16T23:59:59+0000, TUM_I, 2232, 1440,β
,-,β
,β
,β
,-,β
2023-09-17T00:00:00+0000,2023-09-17T23:59:59+0000, TUM_I, 1599, 1440,β
,-,β
,β
,β
,-,β
2023-09-18T00:00:00+0000,2023-09-18T23:59:59+0000, TUM_I, 228, 1440,β
,-,β
,β
,β
,-,β
The numbers in the columns "interferograms" and "ground_pressure" are the number of interferograms and the number of ground_pressure lines for the respective day.