Data Preprocessing - convert to WGMS format

Foreword

This notebook demonstrates the data preprocessing workflow of MassBalanceMachine using Icelandic glacier data. It guides you through converting your data to the WGMS format, which will be used throughout the entire pipeline of MassBalanceMachine. Once formatted correctly, follow the data preparation notebook for WGMS(-like) data.

Purpose

This notebook is for users whose data is not in WGMS format or whose records are not associated with a single measurement. We work with Icelandic glacier stake measurements, which have three recordings per hydrological year (start of winter, end of winter, and start of summer, and end of summer). Our goal is to reformat each dataset record into three separate records, each corresponding to a stake measurement within the hydrological year.

We strive to accommodate various data formats, but occasionally, users may need to make adjustments to ensure compatibility. For assistance, users can refer to the WGMS documentation, providing detailed guidance on formatting requirements. The documentation, from the WGMS 2023 database, can be found here: https://github.com/ODINN-SciML/MassBalanceMachine/blob/main/notebooks/example_data/wgms_documentation.md. This ensures that your data integrates seamlessly into the MassBalanceMachine workflow. If your data format isn’t compatible with this notebook, feel free to use it as inspiration and then submit a pull request with your modifications following our :doc:contributing. This way, other users can benefit from your contributions in the future.

Process

To begin, we will import necessary libraries, including the massbalancemachine library. Following this, we will define the storage location for files related to our region of interest, which in this case is Iceland. The data used in this demonstration is sourced from the Icelandic Glaciers inventory, provided by the Icelandic Meteorological Office. Stake measurements for the Icelandic glaciers have already been retrieved via an API call and merged into a single file.

Note: If your dataset has one measurement period per record and the column names do not match the WGMS format, please update them manually. The required column names for data processing are: POINT_LAT, POINT_LON, YEAR, POINT_ELEVATION, POINT_ID, TO_DATE, FROM_DATE, and POINT_BALANCE. If needed, you can convert your CRS coordinate to WGS84 using the function convert_to_wgs84(). Ensure the column names match exactly, as these names are used throughout the pipeline.

[1]:
import os
repoPath = os.path.join(os.getcwd(), '../')
import pandas as pd
import massbalancemachine as mbm

Transform your Dataset to the WGMS Format

[2]:
# Specify the filename of the input file with the raw data
target_data_fname = repoPath+'notebooks/example_data/iceland/files/iceland_stake_dataset.csv'
# Load the target data
data = pd.read_csv(target_data_fname)

First, let’s examine the dataset to understand its structure, including the columns and the data they contain.

[3]:
display(data.head(10))
stake yr d1 d2 d3 lat lon elevation rhow rhos bw_stratigraphic bs_stratigraphic ba_stratigraphic bw_floating_date bs_floating_date ba_floating_date GLIMSId Name
0 hn14aa 1995 17/09/1994 20/05/1995 16/09/1995 64.885013 -18.773871 1450.4 NaN NaN 2.07 -1.43 0.64 2.07 -1.43 0.64 G341234E64913N NaN
1 hn14aa 1996 16/09/1995 11/05/1996 03/10/1996 64.885013 -18.773871 1449.8 NaN NaN 1.83 -1.30 0.53 1.83 -1.30 0.53 G341234E64913N NaN
2 hn14aa 1999 04/10/1998 15/05/1999 23/09/1999 64.885013 -18.773871 1448.3 NaN NaN NaN NaN 1.04 NaN NaN 1.04 G341234E64913N NaN
3 hn14aa 2000 23/09/1999 13/05/2000 23/09/2000 64.885013 -18.773871 1447.3 513.0 600.0 2.49 -1.11 1.38 2.49 -0.97 1.52 G341234E64913N NaN
4 hn14aa 2001 23/09/2000 11/05/2001 28/09/2001 64.885013 -18.773871 1446.3 499.0 600.0 1.63 -0.84 0.79 1.49 -0.83 0.66 G341234E64913N NaN
5 hn14aa 2003 05/10/2002 14/05/2003 24/09/2003 64.885013 -18.773871 1444.4 522.0 600.0 1.96 -1.72 0.25 1.96 -1.64 0.33 G341234E64913N NaN
6 hn15aa 1996 16/09/1995 11/05/1996 03/10/1996 64.869530 -18.774896 1503.3 NaN NaN 2.27 -1.21 1.06 2.27 -1.21 1.06 G341234E64913N NaN
7 hn15aa 1998 26/09/1997 15/05/1998 04/10/1998 64.869530 -18.774896 1502.4 NaN NaN 1.85 -0.65 1.20 1.85 -0.65 1.20 G341234E64913N NaN
8 hn15aa 1999 04/10/1998 15/05/1999 23/09/1999 64.869530 -18.774896 1502.0 NaN NaN 1.96 -0.42 1.54 1.96 -0.42 1.54 G341234E64913N NaN
9 hn15aa 2000 23/09/1999 13/05/2000 23/09/2000 64.869530 -18.774896 1501.4 505.0 600.0 1.73 -1.20 0.52 1.73 -1.04 0.68 G341234E64913N NaN

Reshaping the dataset to WGMS-format

As you can see, each record in the dataset contains three measurements: one at the start of the hydrological year (beginning of winter), one at the end of winter (start of summer), and one at the end of summer. Of course, these measurement periods can also be arbitrary, as long as they are in three per record. For now, we do not account for other data formats. The purpose of the lines below is to separate these measurements into individual records, each with a single date and surface mass balance.

[4]:
# Please specify the column names on the left side of the dictionary as they are named in your dataset.
# Additionally, add new keys and values for columns you would like to keep from the original dataset.
# These keys and values in the dictionary will be the final column names in your dataset.
wgms_data_columns = {
    'yr': 'YEAR',
    'stake': 'POINT_ID',
    'lat': 'POINT_LAT',
    'lon': 'POINT_LON',
    'elevation': 'POINT_ELEVATION',
    # Do not change these column names (both keys and values)
    'TO_DATE': 'TO_DATE',
    'FROM_DATE': 'FROM_DATE',
    'POINT_BALANCE': 'POINT_BALANCE',
}

# Please specify the three column names for the three measurement dates (these are specifically for the Iceland dataset)
column_names_dates = ['d1', 'd2', 'd3']

# Please specify the three column names for the three surface mass balance measurements (these are specifically for the Iceland dataset)
column_names_smb = ['bw_stratigraphic', 'bs_stratigraphic', 'ba_stratigraphic']

# Reshape the dataset to the WGMS format
data = mbm.data_processing.utils.convert_to_wgms(wgms_data_columns=wgms_data_columns,
                                 data=data,
                                 date_columns=column_names_dates,
                                 smb_columns=column_names_smb)

Let’s take a look at the dataframe after this reshaping process.

[5]:
display(data.head(10))
YEAR POINT_ID POINT_LAT POINT_LON POINT_ELEVATION TO_DATE FROM_DATE POINT_BALANCE
0 1995 hn14aa 64.885013 -18.773871 1450.4 19950520 19940917 2.07
1 1995 hn14aa 64.885013 -18.773871 1450.4 19950916 19950520 -1.43
2 1995 hn14aa 64.885013 -18.773871 1450.4 19950916 19940917 0.64
3 1996 hn14aa 64.885013 -18.773871 1449.8 19960511 19950916 1.83
4 1996 hn14aa 64.885013 -18.773871 1449.8 19961003 19960511 -1.30
5 1996 hn14aa 64.885013 -18.773871 1449.8 19961003 19950916 0.53
6 1999 hn14aa 64.885013 -18.773871 1448.3 19990515 19981004 NaN
7 1999 hn14aa 64.885013 -18.773871 1448.3 19990923 19990515 NaN
8 1999 hn14aa 64.885013 -18.773871 1448.3 19990923 19981004 1.04
9 2000 hn14aa 64.885013 -18.773871 1447.3 20000513 19990923 2.49

Reproject Coordinates to WGS84 Coordinate Reference System

At this stage, if needed, you can convert the current coordinate system (CRS) to WGS84 if it is not already in that format. Please specify the current CRS of the coordinates.

[6]:
data = mbm.data_processing.utils.convert_to_wgs84(data=data, from_crs=4659)
[7]:
data.to_csv(repoPath+'notebooks/example_data/iceland/files/iceland_wgms_dataset.csv',
            index=False)

At this stage, your dataset is ready to be processed further by retrieving topographical and meteorological features and converting the dataset to a monthly resolution. The next step it to follow the data preparation notebook to see how data in the WGMS format can be incorporated into the data processing pipeline.