Data Preprocessing - convert to WGMS format
Foreword
This notebook demonstrates the data preprocessing workflow of MassBalanceMachine using Icelandic glacier data. It guides you through converting your data to the WGMS format, which will be used throughout the entire pipeline of MassBalanceMachine. Once formatted correctly, follow the data preparation notebook for WGMS(-like) data.
Purpose
This notebook is for users whose data is not in WGMS format or whose records are not associated with a single measurement. We work with Icelandic glacier stake measurements, which have three recordings per hydrological year (start of winter, end of winter, and start of summer, and end of summer). Our goal is to reformat each dataset record into three separate records, each corresponding to a stake measurement within the hydrological year.
We strive to accommodate various data formats, but occasionally, users may need to make adjustments to ensure compatibility. For assistance, users can refer to the WGMS documentation, providing detailed guidance on formatting requirements. The documentation, from the WGMS 2023 database, can be found here: https://github.com/ODINN-SciML/MassBalanceMachine/blob/main/notebooks/example_data/wgms_documentation.md. This ensures that your data integrates seamlessly into the MassBalanceMachine
workflow. If your data format isn’t compatible with this notebook, feel free to use it as inspiration and then submit a pull request with your modifications following our :doc:contributing. This way, other users can benefit from your contributions in the future.
Process
To begin, we will import necessary libraries, including the massbalancemachine library. Following this, we will define the storage location for files related to our region of interest, which in this case is Iceland. The data used in this demonstration is sourced from the Icelandic Glaciers inventory, provided by the Icelandic Meteorological Office. Stake measurements for the Icelandic glaciers have already been retrieved via an API call
and merged into a single file.
Note: If your dataset has one measurement period per record and the column names do not match the WGMS format, please update them manually. The required column names for data processing are: POINT_LAT, POINT_LON, YEAR, POINT_ELEVATION, POINT_ID, TO_DATE, FROM_DATE, and POINT_BALANCE. If needed, you can convert your CRS coordinate to WGS84 using the function convert_to_wgs84(). Ensure the column names match exactly, as these names are used throughout the
pipeline.
[1]:
import os
repoPath = os.path.join(os.getcwd(), '../')
import pandas as pd
import massbalancemachine as mbm
Transform your Dataset to the WGMS Format
[2]:
# Specify the filename of the input file with the raw data
target_data_fname = repoPath+'notebooks/example_data/iceland/files/iceland_stake_dataset.csv'
# Load the target data
data = pd.read_csv(target_data_fname)
First, let’s examine the dataset to understand its structure, including the columns and the data they contain.
[3]:
display(data.head(10))
| stake | yr | d1 | d2 | d3 | lat | lon | elevation | rhow | rhos | bw_stratigraphic | bs_stratigraphic | ba_stratigraphic | bw_floating_date | bs_floating_date | ba_floating_date | GLIMSId | Name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | hn14aa | 1995 | 17/09/1994 | 20/05/1995 | 16/09/1995 | 64.885013 | -18.773871 | 1450.4 | NaN | NaN | 2.07 | -1.43 | 0.64 | 2.07 | -1.43 | 0.64 | G341234E64913N | NaN |
| 1 | hn14aa | 1996 | 16/09/1995 | 11/05/1996 | 03/10/1996 | 64.885013 | -18.773871 | 1449.8 | NaN | NaN | 1.83 | -1.30 | 0.53 | 1.83 | -1.30 | 0.53 | G341234E64913N | NaN |
| 2 | hn14aa | 1999 | 04/10/1998 | 15/05/1999 | 23/09/1999 | 64.885013 | -18.773871 | 1448.3 | NaN | NaN | NaN | NaN | 1.04 | NaN | NaN | 1.04 | G341234E64913N | NaN |
| 3 | hn14aa | 2000 | 23/09/1999 | 13/05/2000 | 23/09/2000 | 64.885013 | -18.773871 | 1447.3 | 513.0 | 600.0 | 2.49 | -1.11 | 1.38 | 2.49 | -0.97 | 1.52 | G341234E64913N | NaN |
| 4 | hn14aa | 2001 | 23/09/2000 | 11/05/2001 | 28/09/2001 | 64.885013 | -18.773871 | 1446.3 | 499.0 | 600.0 | 1.63 | -0.84 | 0.79 | 1.49 | -0.83 | 0.66 | G341234E64913N | NaN |
| 5 | hn14aa | 2003 | 05/10/2002 | 14/05/2003 | 24/09/2003 | 64.885013 | -18.773871 | 1444.4 | 522.0 | 600.0 | 1.96 | -1.72 | 0.25 | 1.96 | -1.64 | 0.33 | G341234E64913N | NaN |
| 6 | hn15aa | 1996 | 16/09/1995 | 11/05/1996 | 03/10/1996 | 64.869530 | -18.774896 | 1503.3 | NaN | NaN | 2.27 | -1.21 | 1.06 | 2.27 | -1.21 | 1.06 | G341234E64913N | NaN |
| 7 | hn15aa | 1998 | 26/09/1997 | 15/05/1998 | 04/10/1998 | 64.869530 | -18.774896 | 1502.4 | NaN | NaN | 1.85 | -0.65 | 1.20 | 1.85 | -0.65 | 1.20 | G341234E64913N | NaN |
| 8 | hn15aa | 1999 | 04/10/1998 | 15/05/1999 | 23/09/1999 | 64.869530 | -18.774896 | 1502.0 | NaN | NaN | 1.96 | -0.42 | 1.54 | 1.96 | -0.42 | 1.54 | G341234E64913N | NaN |
| 9 | hn15aa | 2000 | 23/09/1999 | 13/05/2000 | 23/09/2000 | 64.869530 | -18.774896 | 1501.4 | 505.0 | 600.0 | 1.73 | -1.20 | 0.52 | 1.73 | -1.04 | 0.68 | G341234E64913N | NaN |
Reshaping the dataset to WGMS-format
As you can see, each record in the dataset contains three measurements: one at the start of the hydrological year (beginning of winter), one at the end of winter (start of summer), and one at the end of summer. Of course, these measurement periods can also be arbitrary, as long as they are in three per record. For now, we do not account for other data formats. The purpose of the lines below is to separate these measurements into individual records, each with a single date and surface mass balance.
[4]:
# Please specify the column names on the left side of the dictionary as they are named in your dataset.
# Additionally, add new keys and values for columns you would like to keep from the original dataset.
# These keys and values in the dictionary will be the final column names in your dataset.
wgms_data_columns = {
'yr': 'YEAR',
'stake': 'POINT_ID',
'lat': 'POINT_LAT',
'lon': 'POINT_LON',
'elevation': 'POINT_ELEVATION',
# Do not change these column names (both keys and values)
'TO_DATE': 'TO_DATE',
'FROM_DATE': 'FROM_DATE',
'POINT_BALANCE': 'POINT_BALANCE',
}
# Please specify the three column names for the three measurement dates (these are specifically for the Iceland dataset)
column_names_dates = ['d1', 'd2', 'd3']
# Please specify the three column names for the three surface mass balance measurements (these are specifically for the Iceland dataset)
column_names_smb = ['bw_stratigraphic', 'bs_stratigraphic', 'ba_stratigraphic']
# Reshape the dataset to the WGMS format
data = mbm.data_processing.utils.convert_to_wgms(wgms_data_columns=wgms_data_columns,
data=data,
date_columns=column_names_dates,
smb_columns=column_names_smb)
Let’s take a look at the dataframe after this reshaping process.
[5]:
display(data.head(10))
| YEAR | POINT_ID | POINT_LAT | POINT_LON | POINT_ELEVATION | TO_DATE | FROM_DATE | POINT_BALANCE | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1995 | hn14aa | 64.885013 | -18.773871 | 1450.4 | 19950520 | 19940917 | 2.07 |
| 1 | 1995 | hn14aa | 64.885013 | -18.773871 | 1450.4 | 19950916 | 19950520 | -1.43 |
| 2 | 1995 | hn14aa | 64.885013 | -18.773871 | 1450.4 | 19950916 | 19940917 | 0.64 |
| 3 | 1996 | hn14aa | 64.885013 | -18.773871 | 1449.8 | 19960511 | 19950916 | 1.83 |
| 4 | 1996 | hn14aa | 64.885013 | -18.773871 | 1449.8 | 19961003 | 19960511 | -1.30 |
| 5 | 1996 | hn14aa | 64.885013 | -18.773871 | 1449.8 | 19961003 | 19950916 | 0.53 |
| 6 | 1999 | hn14aa | 64.885013 | -18.773871 | 1448.3 | 19990515 | 19981004 | NaN |
| 7 | 1999 | hn14aa | 64.885013 | -18.773871 | 1448.3 | 19990923 | 19990515 | NaN |
| 8 | 1999 | hn14aa | 64.885013 | -18.773871 | 1448.3 | 19990923 | 19981004 | 1.04 |
| 9 | 2000 | hn14aa | 64.885013 | -18.773871 | 1447.3 | 20000513 | 19990923 | 2.49 |
Reproject Coordinates to WGS84 Coordinate Reference System
At this stage, if needed, you can convert the current coordinate system (CRS) to WGS84 if it is not already in that format. Please specify the current CRS of the coordinates.
[6]:
data = mbm.data_processing.utils.convert_to_wgs84(data=data, from_crs=4659)
[7]:
data.to_csv(repoPath+'notebooks/example_data/iceland/files/iceland_wgms_dataset.csv',
index=False)
At this stage, your dataset is ready to be processed further by retrieving topographical and meteorological features and converting the dataset to a monthly resolution. The next step it to follow the data preparation notebook to see how data in the WGMS format can be incorporated into the data processing pipeline.