Skip to content

Configuration

Configuration

Stored in the config/config.json file

Reference

root (object)

A pydantic model describing the config file schema.

version * (constant)

Version of the retrieval pipeline which is compatible with this config file. Retrievals done with any version `1.x` will produce the same output files as retrievals done with version `1.0`. But higher version numbers might use a different config file structure and produce more output files.

general * (object)

metadata (union)

If not set, the pipeline will use local metadata files or abort if the local files are not found. If local files are found, they will always be preferred over the remote data even if the remote source is configured.

Default: null

Options:

#1 (object)

GitHub repository where the location data is stored.

github_repository * (string)

GitHub repository name, e.g. `my-org/my-repo`.

Regex Pattern: "^[a-z0-9-_]+/[a-z0-9-_]+$"

access_token (union)

GitHub access token with read access to the repository, only required if the repository is private.

Default: null

Options:

#1 (string)

Min. Length: 1

#2 (null)

#2 (null)

data * (object)

Location where the input data sourced from.

ground_pressure * (object)

directory path and format configuration of the ground pressure files

path * (string)

Directory path to ground pressure files.

file_regex * (string)

A regex string to match the ground pressure file names. In this string, you can use the placeholders `$(SENSOR_ID)`, `$(YYYY)`, `$(YY)`, `$(MM)`, and `$(DD)` to make this regex target a certain station and date. The placeholder `$(DATE)` is a shortcut for `$(YYYY)$(MM)$(DD)`.

Min. Length: 1

Examples: [ "^$(DATE).tsv$", "^$(SENSOR_ID)_$(DATE).dat$", "^ground-pressure-$(SENSOR_ID)-$(YYYY)-$(MM)-$(DD).csv$" ]

separator * (string)

Separator used in the ground pressure files. Only needed and used if the file format is `text`.

Min. Length: 1

Max. Length: 1

Examples: [ ",", "\t", " ", ";" ]

datetime_column (union)

Column name in the ground pressure files that contains the datetime.

Default: null

Examples: [ "datetime", "dt", "utc-datetime" ]

Options:

#1 (string)

#2 (null)

datetime_column_format (union)

Format of the datetime column in the ground pressure files.

Default: null

Examples: [ "%Y-%m-%dT%H:%M:%S" ]

Options:

#1 (string)

#2 (null)

date_column (union)

Column name in the ground pressure files that contains the date.

Default: null

Examples: [ "date", "d", "utc-date" ]

Options:

#1 (string)

#2 (null)

date_column_format (union)

Format of the date column in the ground pressure files.

Default: null

Examples: [ "%Y-%m-%d", "%Y%m%d", "%d.%m.%Y" ]

Options:

#1 (string)

#2 (null)

time_column (union)

Column name in the ground pressure files that contains the time.

Default: null

Examples: [ "time", "t", "utc-time" ]

Options:

#1 (string)

#2 (null)

time_column_format (union)

Format of the time column in the ground pressure files.

Default: null

Examples: [ "%H:%M:%S", "%H:%M", "%H%M%S" ]

Options:

#1 (string)

#2 (null)

unix_timestamp_column (union)

Column name in the ground pressure files that contains the unix timestamp.

Default: null

Examples: [ "unix-timestamp", "timestamp", "ts" ]

Options:

#1 (string)

#2 (null)

unix_timestamp_column_format (union)

Format of the unix timestamp column in the ground pressure files. I.e. is the Unix timestamp in seconds, milliseconds, etc.?

Default: null

Options:

#1 (string)

Allowed values: [ "s", "ms", "us", "ns" ]

#2 (null)

pressure_column * (string)

Column name in the ground pressure files that contains the pressure.

Examples: [ "pressure", "p", "ground_pressure" ]

pressure_column_format * (string)

Unit of the pressure column in the ground pressure files.

Allowed values: [ "hPa", "Pa", "bar", "mbar", "atm", "psi", "inHg", "mmHg" ]

atmospheric_profiles * (string)

directory path to atmospheric profile files

interferograms * (string)

directory path to ifg files

results * (string)

directory path to results

profiles (union)

Default: null

Options:

#1 (object)

Settings for vertical profiles retrieval. If `null`, the vertical profiles script will stop and log a warning

server * (object)

Settings for accessing the ccycle ftp server. Besides the `email` field, these can be left as default in most cases.

email * (string)

Email address to use to log in to the ccycle ftp server.

Min. Length: 3

max_parallel_requests * (integer)

Maximum number of requests to put in the queue on the ccycle server at the same time. Only when a request is finished, a new one can enter the queue.

Minimum: 1

Maximum: 200

scope (union)

Scope of the vertical profiles to request from the ccycle ftp server. If set to `null`, the script will not request any vertical profiles besides the configured standard sites.

Default: null

Options:

#1 (object)

from_date (string)

Date in format `YYYY-MM-DD` from which to request vertical profile data.

Default: "1900-01-01"

to_date (string)

Date in format `YYYY-MM-DD` until which to request vertical profile data.

Default: "2100-01-01"

models * (array)

list of data types to request from the ccycle ftp server.

Key Schema:

# (string)

Allowed values: [ "GGG2014", "GGG2020" ]

force_download_locations (array)

List of locations to force download data for. These will be downloaded even at times where no instrument in the metadata is located there.

Default: []

Key Schema:

# (string)

#2 (null)

GGG2020_standard_sites * (array)

List of standard sites to request from the ccycle ftp server. The requests for these standard sites are done before any other requests so that data available for these is not rerequested for other sensors. See https://tccon-wiki.caltech.edu/Main/ObtainingGinputData#Requesting_to_be_added_as_a_standard_site for more information.

Key Schema:

# (object)

identifier * (string)

The identifier on the caltech server

lat * (number)

Minimum: -90

Maximum: 90

lon * (number)

Minimum: -180

Maximum: 180

from_date * (string)

Date in format `YYYY-MM-DD` from which this standard site is active.

to_date (string)

Date in format `YYYY-MM-DD` until which this standard site is active. Default is yesterday.

#2 (null)

retrieval (union)

Default: null

Options:

#1 (object)

Settings for automated proffast processing. If `null`, the automated proffast script will stop and log a warning

general * (object)

max_process_count (integer)

How many parallel processes to dispatch. There will be one process per sensor-day. With hyper-threaded CPUs, this can be higher than the number of physical cores.

Minimum: 1

Maximum: 128

Default: 1

ifg_file_regex * (string)

A regex string to match the ifg file names. In this string, `$(SENSOR_ID)`, `$(YYYY)`, `$(YY)`, `$(MM)`, and `$(DD)` are placeholders to target a certain station and date. The placeholder `$(DATE)` is a shortcut for `$(YYYY)$(MM)$(DD)`. They don't have to be used - you can also run the retrieval on any file it finds in the directory using `.*`

Min. Length: 1

Examples: [ "^*\\.\\d+$^$(SENSOR_ID)$(DATE).*\\.\\d+$", "^$(SENSOR_ID)-$(YYYY)-$(MM)-$(DD).*\\.nc$" ]

queue_verbosity (string)

How much information the retrieval queue should print out. In `verbose` mode it will print out the full list of sensor-days for each step of the filtering process. This can help when figuring out why a certain sensor-day is not processed.

Default: "compact"

Allowed values: [ "compact", "verbose" ]

container_dir (union)

Directory to store the containers in. If not set, it will use `./data/containers` inside the pipeline directory. If your system has enough memory, you could also use `/dev/shm` which is a memory-based file system where files are stored in memory and never written to disk.

Default: null

Options:

#1 (string)

#2 (null)

jobs * (array)

List of retrievals to run. The list will be processed sequentially.

Key Schema:

# (object)

Settings for filtering the storage data. Only used if `config.data_sources.storage` is `true`.

retrieval_algorithm * (string)

Which retrieval algorithms to use. Proffast 2.X uses the Proffast Pylot under the hood to dispatch it. Proffast 1.0 uses a custom implementation by us similar to the Proffast Pylot.

Allowed values: [ "proffast-1.0", "proffast-2.2", "proffast-2.3", "proffast-2.4", "proffast-2.4.1" ]

atmospheric_profile_model * (string)

Which vertical profiles to use for the retrieval.

Allowed values: [ "GGG2014", "GGG2020" ]

sensor_ids * (array)

Sensor ids to consider in the retrieval.

Min. Items: 1

Key Schema:

# (string)

from_date * (string)

Date string in format `YYYY-MM-DD` from which to consider data in the storage directory.

to_date (string)

Date string in format `YYYY-MM-DD` until which to consider data in the storage directory. Default is yesterday.

settings (object)

Advanced settings that only apply to this retrieval job

Default: { "store_binary_spectra": false, "dc_min_threshold": 0.05, "dc_var_threshold": 0.1, "use_local_pressure_in_pcxs": false, "use_ifg_corruption_filter": true, "custom_ils": {}, "output_suffix": null, "pressure_calibration_factors": {}, "pressure_calibration_offsets": {} }

store_binary_spectra (boolean)

Whether to store the binary spectra files. These are the files that are used by the retrieval algorithm. They are not needed for the output files, but can be useful for debugging.

Default: false

dc_min_threshold (number)

Value used for the `DC_min` threshold in Proffast. If not set, defaults to the Proffast default.

Minimum: 0.001

Maximum: 0.999

Default: 0.05

dc_var_threshold (number)

Value used for the `DC_var` threshold in Proffast. If not set, defaults to the Proffast default.

Minimum: 0.001

Maximum: 0.999

Default: 0.1

use_local_pressure_in_pcxs (boolean)

Whether to use the local pressure in the pcxs files. If not used, it will tell PCXS to use the pressure from the atmospheric profiles (set the input value in the `.inp` file to `9999.9`). If used, the pipeline computes the solar noon time using `skyfield` and averages the local pressure over the time period noon-2h to noon+2h.

Default: false

use_ifg_corruption_filter (boolean)

Whether to use the ifg corruption filter. This filter is a program based on `preprocess4` and is part of the `tum-esm-utils` library: https://tum-esm-utils.netlify.app/api-reference#tum_esm_utilsinterferograms. If activated, we will only pass the interferograms to the retrieval algorithm that pass the filter - i.e. that won't cause it to crash.

Default: true

custom_ils (object)

Maps sensor IDS to ILS correction values. If not set, the pipeline will use the values published inside the Proffast Pylot codebase (https://gitlab.eudat.eu/coccon-kit/proffastpylot/-/blob/master/prfpylot/ILSList.csv?ref_type=heads).

Default: {}

Key Schema:

# (object)

channel1_me * (number)

channel1_pe * (number)

channel2_me * (number)

channel2_pe * (number)

output_suffix (union)

Suffix to append to the output folders. If not set, the pipeline output folders are named `sensorid/YYYYMMDD/`. If set, the folders are named `sensorid/YYYYMMDD_suffix/`. This is useful when having multiple retrieval jobs processing the same sensor dates with different settings.

Default: null

Options:

#1 (string)

#2 (null)

pressure_calibration_factors (object)

Maps sensor IDS to pressure calibration factors. If not set, it is set to 1 for each sensor. `corrected_pressure = input_pressure * calibration_factor + calibration_offset`

Default: {}

Examples: [ "{\"ma\": 0.99981}", "{\"ma\": 1.00019, \"mb\": 0.99981}" ]

Key Schema:

# (number)

pressure_calibration_offsets (object)

Maps sensor IDS to pressure calibration offsets. If not set, it is set to 0 for each sensor. `corrected_pressure = input_pressure * calibration_factor + calibration_offset`

Default: {}

Examples: [ "{\"ma\": -0.00007}", "{\"ma\": -0.00007, \"mb\": 0.00019}" ]

Key Schema:

# (number)

#2 (null)

bundles (union)

List of output bundling targets.

Default: null

Options:

#1 (array)

Key Schema:

# (object)

There will be one file per sensor id and atmospheric profile and retrieval algorithm combination. The final name looks like `em27-retrieval-bundle-$SENSOR_ID-$RETRIEVAL_ALGORITHM-$ATMOSPHERIC_PROFILE-$FROM_DATE-$TO_DATE$BUNDLE_SUFFIX.$OUTPUT_FORMAT`, e.g.`em27-retrieval-bundle-ma-GGG2020-proffast-2.4-20150801-20240523-v2.1.csv`. The bundle suffix is optional and can be used to distinguish between different internal datasets.

dst_dir * (string)

Directory to write the bundeled outputs to.

output_formats * (array)

List of output formats to write the merged output files in.

Key Schema:

# (string)

Allowed values: [ "csv", "parquet" ]

from_datetime * (string)

Date in format `YYYY-MM-DDTHH:MM:SS` from which to bundle data

to_datetime * (string)

Date in format `YYYY-MM-DDTHH:MM:SS` to which to bundle data

retrieval_algorithms * (array)

The retrieval algorithms for which to bundle the outputs

Key Schema:

# (string)

Allowed values: [ "proffast-1.0", "proffast-2.2", "proffast-2.3", "proffast-2.4", "proffast-2.4.1" ]

atmospheric_profile_models * (array)

The atmospheric profile models for which to bundle the outputs

Key Schema:

# (string)

Allowed values: [ "GGG2014", "GGG2020" ]

sensor_ids * (array)

The sensor ids for which to bundle the outputs

Key Schema:

# (string)

bundle_suffix (union)

Suffix to append to the output bundles.

Default: null

Examples: [ "v2.1", "v2.2", "oco2-gradient-paper-2021" ]

Options:

#1 (string)

Min. Length: 1

#2 (null)

retrieval_job_output_suffix (union)

When you ran the retrieval with a custom suffix, you can specify it here to only bundle the outputs of this suffix. Use the same value here as in the field `config.retrieval.jobs[i].settings.output_suffix`.

Default: null

Options:

#1 (string)

#2 (null)

parse_dc_timeseries (boolean)

Whether to parse the DC timeseries from the results directories. This is an output only available in this Pipeline for Proffast2.4. We adapted the preprocessor to output the DC min/mean/max/variation values for each record of data. If you having issues with a low signal intensity on one or both channels, you can run the retrieval with a very low DC_min threshold and filter the data afterwards instead of having to rerun the retrieval.

Default: false

parse_retrieval_diagnostics (boolean)

Whether to parse the retrieval diagnostics from the results directories - `niter`, `rms`, and `scl` for each retrieval job.

Default: false

#2 (null)

geoms (union)

Default: null

Options:

#1 (object)

sensor_ids * (array)

The sensor ids for which to generate the GEOMS outputs

Key Schema:

# (string)

retrieval_algorithms * (array)

The retrieval algorithms for which to generate the GEOMS outputs

Key Schema:

# (string)

Allowed values: [ "proffast-1.0", "proffast-2.2", "proffast-2.3", "proffast-2.4", "proffast-2.4.1" ]

atmospheric_profile_models * (array)

The atmospheric profile models for which to generate the GEOMS outputs

Key Schema:

# (string)

Allowed values: [ "GGG2014", "GGG2020" ]

from_datetime * (string)

Date in format `YYYY-MM-DDTHH:MM:SS` from which to generate GEOMS data

to_datetime * (string)

Date in format `YYYY-MM-DDTHH:MM:SS` to which to generate GEOMS data

parse_dc_timeseries (boolean)

Whether to parse the DC timeseries from the results directories. This is an output only available in this Pipeline for Proffast2.4. We adapted the preprocessor to output the DC min/mean/max/variation values for each record of data. If you having issues with a low signal intensity on one or both channels, you can run the retrieval with a very low DC_min threshold and filter the data afterwards instead of having to rerun the retrieval.

Default: false

dc_min_xco2 (number)

Only considered if `parse_dc_timeseries` is set. Minimum DC value to consider for XCO2 records in the GEOMS outputs. It not set, it uses the default value of Proffast (0.05).

Default: 0.05

dc_min_xch4 (number)

Only considered if `parse_dc_timeseries` is set. Minimum DC value to consider for XCH4 records in the GEOMS outputs. It not set, it uses the default value of Proffast (0.05).

Default: 0.05

dc_min_xh2o (number)

Only considered if `parse_dc_timeseries` is set. Minimum DC value to consider for XH2O records in the GEOMS outputs. It not set, it uses the default value of Proffast (0.05).

Default: 0.05

dc_min_xco (number)

Only considered if `parse_dc_timeseries` is set. Minimum DC value to consider for XCO records in the GEOMS outputs. It not set, it uses the default value of Proffast (0.05).

Default: 0.05

max_sza (union)

Maximum solar zenith angle to consider in the GEOMS outputs. If not set, it will consider all solar zenith angles.

Default: null

Options:

#1 (number)

#2 (null)

min_xair (union)

Minimum XAIR required to consider in the GEOMS outputs. If not set, it will consider all XAIR values.

Default: null

Options:

#1 (number)

#2 (null)

max_xair (union)

Maximum XAIR required to consider in the GEOMS outputs. If not set, it will consider all XAIR values.

Default: null

Options:

#1 (number)

#2 (null)

conflict_mode (string)

What to do if an output file already exist.

Default: "replace"

Allowed values: [ "error", "skip", "replace" ]

min_datapoints_per_day (integer)

Minimum number of data points per day required to generate a GEOMS file for that day. If not enough data points are available, no GEOMS file will be generated for that day.

Minimum: 1

Default: 11

#2 (null)

Example

{
"version": "1.10",
"general": {
"metadata": {
"github_repository": "tum-esm/em27-metadata-storage",
"access_token": null
},
"data": {
"ground_pressure": {
"path": "path-to-ground-pressure-data",
"file_regex": "^ground-pressure-$(SENSOR_ID)-$(YYYY)-$(MM)-$(DD).csv$",
"separator": ",",
"pressure_column": "pressure",
"pressure_column_format": "hPa",
"date_column": "UTCdate_____",
"date_column_format": "%Y-%m-%d",
"time_column": "UTCtime_____",
"time_column_format": "%H:%M:%S",
"datetime_column": null,
"datetime_column_format": null,
"unix_timestamp_column": null,
"unix_timestamp_column_format": null
},
"atmospheric_profiles": "path-to-atmospheric-profiles",
"interferograms": "path-to-ifg-upload-directory",
"results": "path-to-results-storage"
}
},
"profiles": {
"server": {
"email": "...@...",
"max_parallel_requests": 25
},
"scope": {
"from_date": "2022-01-01",
"to_date": "2022-01-05",
"models": ["GGG2014", "GGG2020"],
"force_download_locations": ["TUM_I"]
},
"GGG2020_standard_sites": [
{
"identifier": "mu",
"lat": 48.151,
"lon": 11.569,
"from_date": "2019-01-01",
"to_date": "2099-12-31"
}
]
},
"retrieval": {
"general": {
"max_process_count": 9,
"ifg_file_regex": "^$(SENSOR_ID)$(DATE).*\\.\\d+$",
"queue_verbosity": "compact",
"container_dir": null
},
"jobs": [
{
"retrieval_algorithm": "proffast-1.0",
"atmospheric_profile_model": "GGG2014",
"sensor_ids": ["ma", "mb", "mc", "md", "me"],
"from_date": "2019-01-01",
"to_date": "2022-12-31",
"settings": {
"store_binary_spectra": true,
"dc_min_threshold": 0.05,
"dc_var_threshold": 0.1,
"use_local_pressure_in_pcxs": true,
"use_ifg_corruption_filter": false,
"custom_ils": {
"ma": {
"channel1_me": 0.9892,
"channel1_pe": -0.001082,
"channel2_me": 0.9892,
"channel2_pe": -0.001082
}
},
"output_suffix": "template_config",
"pressure_calibration_factors": {
"mb": 0.999819
},
"pressure_calibration_offsets": {
"mb": -0.000125
}
}
},
{
"retrieval_algorithm": "proffast-2.3",
"atmospheric_profile_model": "GGG2020",
"sensor_ids": ["ma", "mb", "mc", "md", "me"],
"from_date": "2019-01-01",
"to_date": "2099-12-31"
}
]
},
"bundles": [
{
"dst_dir": "directory-to-write-the-bundles-to",
"output_formats": ["csv", "parquet"],
"from_datetime": "2022-01-01T00:00:00Z",
"to_datetime": "2022-12-31T23:59:59Z",
"retrieval_algorithms": ["proffast-1.0", "proffast-2.4"],
"atmospheric_profile_models": ["GGG2014", "GGG2020"],
"sensor_ids": ["ma", "mb", "mc", "md", "me"],
"parse_dc_timeseries": true,
"parse_retrieval_diagnostics": true
}
],
"geoms": {
"sensor_ids": ["ma", "mb", "mc", "md", "me"],
"retrieval_algorithms": ["proffast-1.0", "proffast-2.4"],
"atmospheric_profile_models": ["GGG2014", "GGG2020"],
"from_datetime": "2022-01-01T00:00:00Z",
"to_datetime": "2022-12-31T23:59:59Z",
"parse_dc_timeseries": false,
"max_sza": 80,
"min_xair": 0.98,
"max_xair": 1.02,
"conflict_mode": "replace"
}
}