Directories
The pipeline and none of the retrieval algorithms manipulate any data in the directories config.general.data.ground_pressure.path, config.general.data.atmospheric_profiles, and config.general.data.interferograms. The retrieval only writes to config.general.data.results.
Interferograms
Point the config.general.data.interferograms variable to the directory your interferograms are stored in.
In this example, ma, mb, and so on are the “sensor ids” used by us (see the next section about metadata).
You must set the config.retrieval.general.ifg_file_regex value to a regex matching your files. In the example above, we can use ^$(SENSOR_ID)$(DATE).*ifg.\\d+$.
Ground Pressure Files
Point the config.general.data.ground_pressure.path variable to the directory where you store your local meteorological files, i.e. your ground pressure logs.
You can fully specify the naming schema of these pressure files. For example, the example above matches the following naming schema: ^ground-pressure-$(SENSOR_ID)-$(YYYY)-$(MM)-$(DD).csv$. When looking for ground pressure files for a certain sensor id and date, the pipeline will replace the placeholders with the actual values. Available placeholders are $(SENSOR_ID), $(YYYY), $(YY), $(MM), and $(DD).
The second thing you need to specify is in which column to find the pressure and datetime data:
- Pressure column: Pass the name of the column with
config.general.data.ground_pressure.pressure_columnand specify the unit (eitherhPaorPa) withconfig.general.data.ground_pressure.pressure_column_format. - Datetime column: Pass exactly, one of the following
- A datetime column containing both date and time information:
datetime_columnanddatetime_column_format - One date column and one time column:
date_column,date_column_format,time_column, andtime_column_format - A column containing a Unix timestamp:
unix_timestamp_columnandunix_timestamp_column_format(formats,ms,us, orns)
- A datetime column containing both date and time information:
All Ground Pressure is assumed to be in UTC!
Examples
The following examples all use the pressure column config:
{ "pressure_column": "pressure", "pressure_column_format": "hPa"}1
Datetime column config:
{ "date_column": "utc-date", "date_column_format": "%Y-%m-%d", "time_column": "utc-time", "time_column_format": "%H:%M:%S"}Corresponding pressure files:
pressure,utc-date,utc-time997.05,2022-06-02,00:00:49997.06,2022-06-02,00:01:49997.06,2022-06-02,00:02:492
Datetime column config:
{ "datetime_column": "utc-datetime", "datetime_column_format": "%Y-%m-%dT%H:%M:%S"}Corresponding pressure files:
pressure,utc-datetime997.05,2022-06-02T00:00:49997.06,2022-06-02T00:01:49997.06,2022-06-02T00:02:493
Datetime column config:
{ "unix_timestamp_column": "utc-datetime", "unix_timestamp_column_format": "s"}Corresponding pressure files:
pressure,unix-timestamp997.05,1654128049997.06,1654128109997.06,1654128169Atmospheric Profiles
Point the config.general.data.atmospheric_profiles variable to the directory you want to store the atmospheric profiles.
Results
The pipeline populates the results directory in the following way:
The about.json file in each successful retrieval directory contains all information required to reproduce the respective retrieval results. The structure of the directories in failed/ and successful/ is the same - the outputs are moved to successful/ if the retrieval has produced a final CSV file and to failed otherwise.
Bundles
With config.bundles, you can specify a list of bundles to produce from the raw retrieval results. The script will generate one bundle per sensor, retrieval algorithm, and atmospheric profile. For example, when using the following bundle config:
{ "dst_dir": "/some/path/where/the/bundle_should/be/written/to", "output_formats": ["csv", "parquet"], "from_datetime": "2024-05-10T00:00:00+0000", "to_datetime": "2024-07-09T23:59:59+0000", "retrieval_algorithms": ["proffast-2.2", "proffast-2.4"], "atmospheric_profile_models": ["GGG2020"], "sensor_ids": ["ma", "mb"]}… the following bundles will be generated in the output directory /some/path/where/the/bundle_should/be/written/to:
The output files will include all data from the <config.general.data.results> path that matches the time period. The CSV and Parquet files contain the same data - just in two different tabular formats. They keep all columns from the raw retrieval algorithm but add four more columns utc, retrieval_time, location_id and campaign_ids:
utc: parsed from theUTC/HHMMSS_IDcolumns to have a consistent timestamp formatretrieval_time: the timestamp when the retrieval was finishedlocation_id: the location ID of the sensor at that timecampaign_ids: the campaign IDs that match this datapoint separated by a+sign
Proffast 1.0 bundle example:
utc,HHMMSS_ID,ground_pressure,lat,lon,alt,sza,azi,XH2O,XAIR,XCO2,XCH4,XCH4_S5P,XCO,retrieval_time,location_id,campaign_ids2022-06-02T05:13:49.000000+0000,51349.0,998.2,48.148,16.438,180.0,70.1,-101.45,3316.9,1.00387,418.077,1.8772,0.0,0.0,2024-09-11T22:48:42.000000+0000,ZEN,both+only-mc2022-06-02T05:14:04.000000+0000,51404.0,998.2,48.148,16.438,180.0,70.06,-101.41,3317.72,1.00343,417.989,1.87669,0.0,0.0,2024-09-11T22:48:42.000000+0000,ZEN,both+only-mc2022-06-02T05:14:19.000000+0000,51419.0,998.2,48.148,16.438,180.0,70.02,-101.37,3317.16,1.00421,417.361,1.87585,0.0,0.0,2024-09-11T22:48:42.000000+0000,ZEN,both+only-mc...Proffast 2.4 bundle example:
utc,spectrum,ground_pressure,lat,lon,alt,sza,azi,XH2O,XAIR,XCO2,XCH4,XCO2_STR,XCO,XCH4_S5P,H2O,O2,CO2,CH4,CO,CH4_S5P,retrieval_time,location_id,campaign_ids2022-06-02T05:13:55.000000+0000,220602_051349SN.BIN,998.2,48.148,16.438,180.0,70.1,-101.45,3435.8,0.998586,420.051,1.88495,0.0,0.0,0.0,7.24389e26,4.46289e28,8.89103e25,4.01976e23,0.0,0.0,2024-09-11T22:50:05.000000+0000,ZEN,both+only-mc2022-06-02T05:14:09.000000+0000,220602_051404SN.BIN,998.19,48.148,16.438,180.0,70.06,-101.41,3436.61,0.998166,419.96,1.88445,0.0,0.0,0.0,7.24253e26,4.46095e28,8.88534e25,4.01701e23,0.0,0.0,2024-09-11T22:50:05.000000+0000,ZEN,both+only-mc2022-06-02T05:14:24.000000+0000,220602_051419SN.BIN,998.19,48.148,16.438,180.0,70.02,-101.37,3435.96,0.998954,419.327,1.88353,0.0,0.0,0.0,7.24683e26,4.46442e28,8.87892e25,4.01823e23,0.0,0.0,2024-09-11T22:50:05.000000+0000,ZEN,both+only-mc...Filtering by Campaign ID can be done with one line of code:
import polars as pl
df = pl.read_parquet()
df = df.filter(pl.col("campaign_ids").str.split("+").list.contains("muccnet"))For example our MUCCnet campaign config looks like this:
{ "campaign_id": "muccnet", "from_datetime": "2019-09-13T00:00:00+0000", "to_datetime": "2100-01-01T23:59:59+0000", "sensor_ids": ["ma", "mb", "mc", "md", "me"], "location_ids": ["TUM_I", "FEL", "GRAE", "OBE", "TAU", "DLR_2", "DLR_3"]}… the dataframe filtered by the campaign ID muccnet will only contain dat that has been generated between 2019-09-13 and 2100-01-01 by the sensors ma, mb, mc, md, and me at the locations TUM_I, FEL, GRAE, OBE, TAU, DLR_2, and DLR_3.
Logs
The logs are stored within the directory of the pipeline at data/logs:
The files are either from containers (startingdate-startingtime_containername.log) or from the main process (startingdate-startingtime_main.log), which orchestrates the containers.
Internal
Containers
The containers in which the retrieval is running are working on data/containers. Each container with a container name like eloquent-oppenheimer has three active directories: data/containers/retrieval-container-$containername, data/containers/retrieval-container-$containername-input, and data/containers/retrieval-container-$containername-output.
You can change data/containers to some other path (maybe a high-speed local storage device) using config.retrieval.general.container_dir.
Profiles Query Cache
The profiles downloader uses the file data/profiles_query_cache.json to save the information on which profiles have already been requested. Profiles will only be re-requested if they have not been produced within 24 hours.