Data Preprocessing - convert to WGMS format

Foreword

This notebook demonstrates the data preprocessing workflow of MassBalanceMachine using Icelandic glacier data. It guides you through converting your data to the WGMS format, which will be used throughout the entire pipeline of MassBalanceMachine. Once formatted correctly, follow the data preparation notebook for WGMS(-like) data.

Purpose

This notebook is for users whose data is not in WGMS format or whose records are not associated with a single measurement. We work with Icelandic glacier stake measurements, which have three recordings per hydrological year (start of winter, end of winter, and start of summer, and end of summer). Our goal is to reformat each dataset record into three separate records, each corresponding to a stake measurement within the hydrological year.

We strive to accommodate various data formats, but occasionally, users may need to make adjustments to ensure compatibility. For assistance, users can refer to the WGMS documentation, providing detailed guidance on formatting requirements. The documentation, from the WGMS 2023 database, can be found here: https://github.com/ODINN-SciML/MassBalanceMachine/blob/main/notebooks/example_data/wgms_documentation.md. This ensures that your data integrates seamlessly into the MassBalanceMachine workflow. If your data format isn’t compatible with this notebook, feel free to use it as inspiration and then submit a pull request with your modifications following our :doc:contributing. This way, other users can benefit from your contributions in the future.

Process

To begin, we will import necessary libraries, including the massbalancemachine library. Following this, we will define the storage location for files related to our region of interest, which in this case is Iceland. The data used in this demonstration is sourced from the Icelandic Glaciers inventory, provided by the Icelandic Meteorological Office. Stake measurements for the Icelandic glaciers have already been retrieved via an API call and merged into a single file.

Note: If your dataset has one measurement period per record and the column names do not match the WGMS format, please update them manually. The required column names for data processing are: POINT_LAT, POINT_LON, YEAR, POINT_ELEVATION, POINT_ID, TO_DATE, FROM_DATE, and POINT_BALANCE. If needed, you can convert your CRS coordinate to WGS84 using the function convert_to_wgs84(). Ensure the column names match exactly, as these names are used throughout the pipeline.

[1]:

import os
repoPath = os.path.join(os.getcwd(), '../')
import pandas as pd
import massbalancemachine as mbm

Transform your Dataset to the WGMS Format

[2]:

# Specify the filename of the input file with the raw data
target_data_fname = repoPath+'notebooks/example_data/iceland/files/iceland_stake_dataset.csv'
# Load the target data
data = pd.read_csv(target_data_fname)

First, let’s examine the dataset to understand its structure, including the columns and the data they contain.

[3]:

display(data.head(10))

	stake	yr	d1	d2	d3	lat	lon	elevation	rhow	rhos	bw_stratigraphic	bs_stratigraphic	ba_stratigraphic	bw_floating_date	bs_floating_date	ba_floating_date	GLIMSId	Name
0	hn14aa	1995	17/09/1994	20/05/1995	16/09/1995	64.885013	-18.773871	1450.4	NaN	NaN	2.07	-1.43	0.64	2.07	-1.43	0.64	G341234E64913N	NaN
1	hn14aa	1996	16/09/1995	11/05/1996	03/10/1996	64.885013	-18.773871	1449.8	NaN	NaN	1.83	-1.30	0.53	1.83	-1.30	0.53	G341234E64913N	NaN
2	hn14aa	1999	04/10/1998	15/05/1999	23/09/1999	64.885013	-18.773871	1448.3	NaN	NaN	NaN	NaN	1.04	NaN	NaN	1.04	G341234E64913N	NaN
3	hn14aa	2000	23/09/1999	13/05/2000	23/09/2000	64.885013	-18.773871	1447.3	513.0	600.0	2.49	-1.11	1.38	2.49	-0.97	1.52	G341234E64913N	NaN
4	hn14aa	2001	23/09/2000	11/05/2001	28/09/2001	64.885013	-18.773871	1446.3	499.0	600.0	1.63	-0.84	0.79	1.49	-0.83	0.66	G341234E64913N	NaN
5	hn14aa	2003	05/10/2002	14/05/2003	24/09/2003	64.885013	-18.773871	1444.4	522.0	600.0	1.96	-1.72	0.25	1.96	-1.64	0.33	G341234E64913N	NaN
6	hn15aa	1996	16/09/1995	11/05/1996	03/10/1996	64.869530	-18.774896	1503.3	NaN	NaN	2.27	-1.21	1.06	2.27	-1.21	1.06	G341234E64913N	NaN
7	hn15aa	1998	26/09/1997	15/05/1998	04/10/1998	64.869530	-18.774896	1502.4	NaN	NaN	1.85	-0.65	1.20	1.85	-0.65	1.20	G341234E64913N	NaN
8	hn15aa	1999	04/10/1998	15/05/1999	23/09/1999	64.869530	-18.774896	1502.0	NaN	NaN	1.96	-0.42	1.54	1.96	-0.42	1.54	G341234E64913N	NaN
9	hn15aa	2000	23/09/1999	13/05/2000	23/09/2000	64.869530	-18.774896	1501.4	505.0	600.0	1.73	-1.20	0.52	1.73	-1.04	0.68	G341234E64913N	NaN

Reshaping the dataset to WGMS-format

As you can see, each record in the dataset contains three measurements: one at the start of the hydrological year (beginning of winter), one at the end of winter (start of summer), and one at the end of summer. Of course, these measurement periods can also be arbitrary, as long as they are in three per record. For now, we do not account for other data formats. The purpose of the lines below is to separate these measurements into individual records, each with a single date and surface mass balance.

[4]:

# Please specify the column names on the left side of the dictionary as they are named in your dataset.
# Additionally, add new keys and values for columns you would like to keep from the original dataset.
# These keys and values in the dictionary will be the final column names in your dataset.
wgms_data_columns = {
    'yr': 'YEAR',
    'stake': 'POINT_ID',
    'lat': 'POINT_LAT',
    'lon': 'POINT_LON',
    'elevation': 'POINT_ELEVATION',
    # Do not change these column names (both keys and values)
    'TO_DATE': 'TO_DATE',
    'FROM_DATE': 'FROM_DATE',
    'POINT_BALANCE': 'POINT_BALANCE',
}

# Please specify the three column names for the three measurement dates (these are specifically for the Iceland dataset)
column_names_dates = ['d1', 'd2', 'd3']

# Please specify the three column names for the three surface mass balance measurements (these are specifically for the Iceland dataset)
column_names_smb = ['bw_stratigraphic', 'bs_stratigraphic', 'ba_stratigraphic']

# Reshape the dataset to the WGMS format
data = mbm.data_processing.utils.convert_to_wgms(wgms_data_columns=wgms_data_columns,
                                 data=data,
                                 date_columns=column_names_dates,
                                 smb_columns=column_names_smb)

Let’s take a look at the dataframe after this reshaping process.

[5]:

display(data.head(10))

	YEAR	POINT_ID	POINT_LAT	POINT_LON	POINT_ELEVATION	TO_DATE	FROM_DATE	POINT_BALANCE
0	1995	hn14aa	64.885013	-18.773871	1450.4	19950520	19940917	2.07
1	1995	hn14aa	64.885013	-18.773871	1450.4	19950916	19950520	-1.43
2	1995	hn14aa	64.885013	-18.773871	1450.4	19950916	19940917	0.64
3	1996	hn14aa	64.885013	-18.773871	1449.8	19960511	19950916	1.83
4	1996	hn14aa	64.885013	-18.773871	1449.8	19961003	19960511	-1.30
5	1996	hn14aa	64.885013	-18.773871	1449.8	19961003	19950916	0.53
6	1999	hn14aa	64.885013	-18.773871	1448.3	19990515	19981004	NaN
7	1999	hn14aa	64.885013	-18.773871	1448.3	19990923	19990515	NaN
8	1999	hn14aa	64.885013	-18.773871	1448.3	19990923	19981004	1.04
9	2000	hn14aa	64.885013	-18.773871	1447.3	20000513	19990923	2.49

Reproject Coordinates to WGS84 Coordinate Reference System

At this stage, if needed, you can convert the current coordinate system (CRS) to WGS84 if it is not already in that format. Please specify the current CRS of the coordinates.

[6]:

data = mbm.data_processing.utils.convert_to_wgs84(data=data, from_crs=4659)

[7]:

data.to_csv(repoPath+'notebooks/example_data/iceland/files/iceland_wgms_dataset.csv',
            index=False)

At this stage, your dataset is ready to be processed further by retrieving topographical and meteorological features and converting the dataset to a monthly resolution. The next step it to follow the data preparation notebook to see how data in the WGMS format can be incorporated into the data processing pipeline.