Data preparation

In this notebook, we will demonstrate the data processing workflow of the MassBalanceMachine using an example with glaciological data from Icelandic glaciers. This example will guide you through the data processing pipeline, which retrieves topographical and meteorological features for all points with glaciological surface mass balance observations, and transforms the data to a monthly resolution.

We begin by importing some basic libraries along with the massbalancemachine library. Next, we specify the location where we will store the files for the region of interest. The data used in this example is from the Icelandic Glaciers inventory, that was already preprocessed, to align with the WGMS data format we are using, in the data preprocessing notebook. You can also download data from the WGMS database, as you desire.

Note: WGMS data can contain errors or incorrect data, please check the data for correctness before using it.

Note: All data, stake measurements and glaciers outlines, are expected to have the WGS84 CRS.

[1]:

import os

repoPath = os.path.join(os.getcwd(), '../')
import pandas as pd
import geopandas as gpd

import massbalancemachine as mbm

cfg = mbm.Config()

1. Load your Target Surface Mass Balance Dataset and Retrieve RGI ID per Stake

In this step, we define and load our glaciological data from a region of interest. The WGMS dataset does not include RGI IDs by default, so we need to retrieve them from a glacier outline shapefile provided by the Randolph Glacier Inventory (v6). Each stake is then matched with an RGI ID. The RGI ID is necessary for the MassBalanceMachine to add additional topographical and meteorological features for the training stage.

The glacier outlines are automatically downloaded through OGGM as a shapefile. It is also possible to provide your own shapefile directly (see example hereafter).

Note: Data records that have an invalid FROM_DATE or TO_DATE, where the day or month is indicated as 99, are deleted from the dataset.

[2]:

# Specify the filename of the input file with the raw data
target_data_fname = repoPath + 'notebooks/example_data/iceland/files/iceland_wgms_dataset.csv'

# Load the target data
data = pd.read_csv(target_data_fname)

1.1 Match the Stake Measurements with a RGI ID

Based on the location of the stake measurement given by POINT_LAT and POINT_LON, each data record is matched with the RGI ID for the glacier where the stake is located.

[3]:

# # Get the RGI ID for each stake measurement for the region of interest
# # The shape file is automatically retrieved with OGMM in the background
# data = mbm.data_processing.utils.get_rgi(data=data, region=6)
# data

[4]:

# Specify the shape filename of the glaciers outline obtained from RGIv6
# The shape file is specified by the user
glacier_outline_fname = repoPath + 'notebooks/example_data/iceland/files/06_rgi62_Iceland.shp'
# Load the glacier outlines
glacier_outline = gpd.read_file(glacier_outline_fname)
# Get the RGI ID for each stake measurement for the region of interest
data = mbm.data_processing.utils.get_rgi(data=data, glacier_outlines=glacier_outline)
data

[4]:

	YEAR	POINT_ID	POINT_LAT	POINT_LON	POINT_ELEVATION	TO_DATE	FROM_DATE	POINT_BALANCE	RGIId
0	1995	hn14aa	64.885013	-18.773871	1450.4	19950520	19940917	2.07	RGI60-06.00228
29	2000	hn15aa	64.869530	-18.774896	1501.4	20000923	19990923	0.52	RGI60-06.00228
31	2001	hn15aa	64.869530	-18.774896	1500.9	20010928	20010511	-0.47	RGI60-06.00228
32	2001	hn15aa	64.869530	-18.774896	1500.9	20010928	20000923	1.92	RGI60-06.00228
33	2017	blt11	64.687726	-18.954530	1077.0	20170501	20161016	1.85	RGI60-06.00232
34	2017	blt11	64.687726	-18.954530	1077.0	20171005	20170501	-2.97	RGI60-06.00232
35	2017	blt11	64.687726	-18.954530	1077.0	20171005	20161016	-1.12	RGI60-06.00232
36	2018	blt11	64.687726	-18.954530	1078.0	20180428	20171005	1.41	RGI60-06.00232
37	2018	blt11	64.687726	-18.954530	1078.0	20181008	20180428	-2.09	RGI60-06.00232
38	2018	blt11	64.687726	-18.954530	1078.0	20181008	20171005	-0.68	RGI60-06.00232
39	2019	blt11	64.687726	-18.954530	1076.0	20190430	20181008	1.31	RGI60-06.00232
40	2019	blt11	64.687726	-18.954530	1076.0	20191105	20190430	-4.18	RGI60-06.00232
41	2019	blt11	64.687726	-18.954530	1076.0	20191105	20181008	-2.87	RGI60-06.00232
42	2020	blt11	64.687726	-18.954530	1076.0	20200428	20191105	0.95	RGI60-06.00232
43	2020	blt11	64.687726	-18.954530	1076.0	20201010	20200428	-2.82	RGI60-06.00232
44	2020	blt11	64.687726	-18.954530	1076.0	20201010	20191105	-1.87	RGI60-06.00232
45	2021	blt11	64.687726	-18.954530	1076.0	20210429	20201010	0.89	RGI60-06.00232
46	2021	blt11	64.687726	-18.954530	1076.0	20211021	20210429	-3.48	RGI60-06.00232
47	2021	blt11	64.687726	-18.954530	1076.0	20211021	20201010	-2.59	RGI60-06.00232
48	2022	blt11	64.687726	-18.954530	1076.0	20220426	20211021	1.52	RGI60-06.00232
49	2022	blt11	64.687726	-18.954530	1076.0	20221017	20220426	-3.12	RGI60-06.00232
50	2022	blt11	64.687726	-18.954530	1076.0	20221017	20211021	-1.60	RGI60-06.00232
51	2017	blt8	64.662707	-18.954942	826.0	20170501	20161016	1.02	RGI60-06.00232
52	2017	blt8	64.662707	-18.954942	826.0	20171005	20170501	-4.64	RGI60-06.00232
53	2017	blt8	64.662707	-18.954942	826.0	20171005	20161016	-3.62	RGI60-06.00232
54	2018	blt8	64.662707	-18.954942	821.0	20180428	20171005	0.79	RGI60-06.00232
55	2018	blt8	64.662707	-18.954942	821.0	20181008	20180428	-3.42	RGI60-06.00232
30	2001	hn15aa	64.869530	-18.774896	1500.9	20010511	20000923	2.39	RGI60-06.00228
28	2000	hn15aa	64.869530	-18.774896	1501.4	20000923	20000513	-1.20	RGI60-06.00228
1	1995	hn14aa	64.885013	-18.773871	1450.4	19950916	19950520	-1.43	RGI60-06.00228
27	2000	hn15aa	64.869530	-18.774896	1501.4	20000513	19990923	1.73	RGI60-06.00228
2	1995	hn14aa	64.885013	-18.773871	1450.4	19950916	19940917	0.64	RGI60-06.00228
3	1996	hn14aa	64.885013	-18.773871	1449.8	19960511	19950916	1.83	RGI60-06.00228
4	1996	hn14aa	64.885013	-18.773871	1449.8	19961003	19960511	-1.30	RGI60-06.00228
5	1996	hn14aa	64.885013	-18.773871	1449.8	19961003	19950916	0.53	RGI60-06.00228
6	1999	hn14aa	64.885013	-18.773871	1448.3	19990515	19981004	NaN	RGI60-06.00228
7	1999	hn14aa	64.885013	-18.773871	1448.3	19990923	19990515	NaN	RGI60-06.00228
8	1999	hn14aa	64.885013	-18.773871	1448.3	19990923	19981004	1.04	RGI60-06.00228
9	2000	hn14aa	64.885013	-18.773871	1447.3	20000513	19990923	2.49	RGI60-06.00228
10	2000	hn14aa	64.885013	-18.773871	1447.3	20000923	20000513	-1.11	RGI60-06.00228
11	2000	hn14aa	64.885013	-18.773871	1447.3	20000923	19990923	1.38	RGI60-06.00228
12	2001	hn14aa	64.885013	-18.773871	1446.3	20010511	20000923	1.63	RGI60-06.00228
13	2001	hn14aa	64.885013	-18.773871	1446.3	20010928	20010511	-0.84	RGI60-06.00228
14	2001	hn14aa	64.885013	-18.773871	1446.3	20010928	20000923	0.79	RGI60-06.00228
15	2003	hn14aa	64.885013	-18.773871	1444.4	20030514	20021005	1.96	RGI60-06.00228
16	2003	hn14aa	64.885013	-18.773871	1444.4	20030924	20030514	-1.72	RGI60-06.00228
17	2003	hn14aa	64.885013	-18.773871	1444.4	20030924	20021005	0.25	RGI60-06.00228
18	1996	hn15aa	64.869530	-18.774896	1503.3	19960511	19950916	2.27	RGI60-06.00228
19	1996	hn15aa	64.869530	-18.774896	1503.3	19961003	19960511	-1.21	RGI60-06.00228
20	1996	hn15aa	64.869530	-18.774896	1503.3	19961003	19950916	1.06	RGI60-06.00228
21	1998	hn15aa	64.869530	-18.774896	1502.4	19980515	19970926	1.85	RGI60-06.00228
22	1998	hn15aa	64.869530	-18.774896	1502.4	19981004	19980515	-0.65	RGI60-06.00228
23	1998	hn15aa	64.869530	-18.774896	1502.4	19981004	19970926	1.20	RGI60-06.00228
24	1999	hn15aa	64.869530	-18.774896	1502.0	19990515	19981004	1.96	RGI60-06.00228
25	1999	hn15aa	64.869530	-18.774896	1502.0	19990923	19990515	-0.42	RGI60-06.00228
26	1999	hn15aa	64.869530	-18.774896	1502.0	19990923	19981004	1.54	RGI60-06.00228
56	2018	blt8	64.662707	-18.954942	821.0	20181008	20171005	-2.63	RGI60-06.00232

Then, we can create a Dataset, by using the loaded dataframe for Iceland stake data together with the matched RGI IDs.

[5]:

# Provide the column name for the column that has the RGI IDs for each of the stakes
dataset = mbm.data_processing.Dataset(cfg,
                                      data=data,
                                      region_name='iceland',
                                      region_id=6,
                                      data_path=repoPath +
                                      'notebooks/example_data/iceland/files/')

2. Get Topographical Features per Stake Measurement

Once we have created a Dataset, the first thing we can do is to add topographical data in our dataset. This can be done automatically with MassBalanceMachine, which internally calls OGGM, by doing the following:

[6]:

# Specify the topographical features of interest
# Please see the OGGM documentation what variables are available: https://oggm.org/tutorials/stable/notebooks/10minutes/machine_learning.html ('topo', 'slope_factor', 'dis_from_border')
voi_topographical = ['aspect', 'slope']
# You can also use the sky view factor ('svf') in which case it will be computed on-the-fly only the first time it is needed
# voi_topographical = ['aspect', 'slope', 'svf']

# Retrieve the topographical features for each stake measurement based on the latitude and longitude of the stake and add them to the dataset
dataset.get_topo_features(vois=voi_topographical)

2026-06-04 11:07:26: oggm.cfg: Reading default parameters from the OGGM `params.cfg` configuration file.
2026-06-04 11:07:26: oggm.cfg: Multiprocessing switched OFF according to the parameter file.
2026-06-04 11:07:26: oggm.cfg: Multiprocessing: using all available processors (N=2)
100% of  23.9 MiB |######################| Elapsed Time: 0:00:00 Time:  0:00:00
2026-06-04 11:07:27: oggm.cfg: PARAMS['border'] changed from `80` to `10`.
2026-06-04 11:07:27: oggm.cfg: Multiprocessing switched ON after user settings.
2026-06-04 11:07:27: oggm.cfg: PARAMS['continue_on_error'] changed from `False` to `True`.
2026-06-04 11:07:28: oggm.workflow: init_glacier_directories from prepro level 2 on 2 glaciers.
2026-06-04 11:07:28: oggm.workflow: Execute entity tasks [gdir_from_prepro] on 2 glaciers
100% of 117.9 MiB |######################| Elapsed Time: 0:01:26 Time:  0:01:26
2026-06-04 11:08:55: oggm.workflow: Execute entity tasks [gridded_attributes] on 2 glaciers

[7]:

df = dataset.data
df['MONTH_START'] = [str(date)[4:6] for date in df.FROM_DATE]
df['MONTH_END'] = [str(date)[4:6] for date in df.TO_DATE]
df.MONTH_START.unique(), df.MONTH_END.unique()

[7]:

(array(['09', '05', '10', '04', '11'], dtype=object),
 array(['05', '09', '10', '04', '11'], dtype=object))

3. Get Meteorological Features per Stake Measurement

Once we have the topographical data, we can add the necessary climate data for the dataset. There are two ways to do so:

manually download the dataset;
or configure your CDS token to use the CDSAPI and MassBalanceMachine will automatically download the data for you.

3.1. Manually download ERA5 data

This is done by pulling that from ERA5-Land database for the region of interest. Before the climate data is matched with the stake measurements, we need to manually download the climate data for the region of interest from Climate Copernicus website. Check the following options:

Field	Details
Product type:	Monthly averaged reanalysis
Variable:	2m temperature (t2m) Total precipitation (tp) Surface latent heat flux (slhf) Surface sensible heat flux (sshf) Surface solar radiation downwards (ssrd) Forecast albedo (fal) Surface net thermal radiation (str) … You can explore additional options as you prefer. In this example, we use the variables listed above.
Year:	Select all
Month:	Select all
Time:	Select all
Geographical area:	Either download for the entire Earth, or specify the coordinates for a bounding box in the region of interest. For the region of interest, provide a North, East, South, and West coordinate, specifying the two corners of the bounding box.
Format:	NetCDF-3

Then click Submit Request (after you have registered or logged into your account). Please be aware that downloading this data can take up to several hours, depending on the number of variables, timescales, and the area selected. Once the download is complete, place the netCDF file in the correct directory and rename it accordingly.

Additionally, we need the geopotential height as an extra variable in our dataset. We will calculate a new variable by determining the difference between the geopotential height and the elevation at the stake measurement. The purpose of this height difference is to encourage the model to use it for downscaling the ERA5Land climate data, rather than relying on the lapse rate. The geopotential is downloaded from here, and is already included in this example.

[8]:

# Specify the files of the climate data, that will be matched with the coordinates of the stake data
era5_climate_data = repoPath + 'notebooks/example_data/iceland/climate/era5_monthly_averaged_data.nc'
geopotential_data = repoPath + 'notebooks/example_data/iceland/climate/era5_geopotential_pressure.nc'

# Match the climate features, from the ERA5Land netCDF file, for each of the stake measurement dataset
dataset.get_climate_features(climate_data=era5_climate_data,
                             geopotential_data=geopotential_data)

3.2. Automatically download ERA5 data using the CDSAPI

This example is commented below since this can take some time to download and you need to create valid tokens on the CDS website.

[9]:

# Match the climate features, from the ERA5Land netCDF file, for each of the stake measurement dataset
# dataset.get_climate_features()

Climate data for each stake position is retrieved for each month of the hydrological year (1 Oct - 30 Sept) given by YEAR.

4. Transform Data to Monthly Time Resolution

Finally, we need to transform the dataset into a monthly resolution, accounting for a variable period (FROM_DATE and TO_DATE) for the SMB target data. This will be done in order to perform SMB predictions at a monthly time step, which then will be integrated both in time and space to match the available glaciological and geodetic SMB observations for the training. Please specify the climate variables used in the previous step in the list below. Consult the ERA5 Land documentation of all short names of the climate variables.

When accounting for variable periods MassBalanceMachine uses the closest start of month to FROM_DATE and TO_DATE and includes all months between. For example, FROM_DATE 19940917 and TO_DATE 19950520 will include all months from October 1994 through May 1995.

[10]:

# Specify the short names of the climate variables available in the dataset
vois_climate = ['t2m', 'tp', 'slhf', 'sshf', 'ssrd', 'fal', 'str']

# For each record, convert to a monthly time resolution
dataset.convert_to_monthly(vois_climate=vois_climate,
                           vois_topographical=voi_topographical)

Finally, we can take a look at the dataset which will be used for training.

[11]:

display(dataset.data)

	ELEVATION_DIFFERENCE	slope	MONTHS	POINT_ELEVATION	POINT_ID	YEAR	POINT_BALANCE	N_MONTHS	POINT_LAT	RGIId	...	POINT_LON	aspect	ID	t2m	tp	slhf	sshf	ssrd	fal	str
0	116.476388	0.041035	oct	1450.4	hn14aa	1995	2.07	8	64.885013	RGI60-06.00228	...	-18.773871	5.810070	0	267.885682	0.005071	-32688.346894	1.908546e+05	3.434260e+06	0.850005	-1.029337e+06
1	116.476388	0.041035	nov	1450.4	hn14aa	1995	2.07	8	64.885013	RGI60-06.00228	...	-18.773871	5.810070	0	266.376346	0.006053	301104.083653	8.280538e+05	8.424995e+05	0.849992	-1.431540e+06
2	116.476388	0.041035	dec	1450.4	hn14aa	1995	2.07	8	64.885013	RGI60-06.00228	...	-18.773871	5.810070	0	263.049011	0.005854	248241.745197	9.954409e+05	1.322171e+05	0.849992	-2.002829e+06
3	116.476388	0.041035	jan	1450.4	hn14aa	1995	2.07	8	64.885013	RGI60-06.00228	...	-18.773871	5.810070	0	261.692810	0.004156	348585.225978	1.243700e+06	4.884578e+05	0.849992	-1.792889e+06
4	116.476388	0.041035	feb	1450.4	hn14aa	1995	2.07	8	64.885013	RGI60-06.00228	...	-18.773871	5.810070	0	261.140088	0.002287	274514.643950	1.004845e+06	2.580602e+06	0.850005	-1.861757e+06
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
437	-246.593874	0.067458	may	821.0	blt8	2018	-2.63	12	64.662707	RGI60-06.00232	...	-18.954942	3.476423	54	271.553741	0.006217	-264554.591916	3.063329e+05	2.073085e+07	0.826504	-2.648321e+06
438	-246.593874	0.067458	jun	821.0	blt8	2018	-2.63	12	64.662707	RGI60-06.00232	...	-18.954942	3.476423	54	276.144439	0.003192	10994.304045	6.234279e+05	1.787212e+07	0.768536	-9.177811e+05
439	-246.593874	0.067458	jul	821.0	blt8	2018	-2.63	12	64.662707	RGI60-06.00232	...	-18.954942	3.476423	54	277.553280	0.003422	-589325.605420	7.537638e+04	1.425122e+07	0.676787	-7.246837e+05
440	-246.593874	0.067458	aug	821.0	blt8	2018	-2.63	12	64.662707	RGI60-06.00232	...	-18.954942	3.476423	54	276.193729	0.002495	-807105.778219	-1.563324e+05	1.332676e+07	0.677706	-1.884936e+06
441	-246.593874	0.067458	sep	821.0	blt8	2018	-2.63	12	64.662707	RGI60-06.00232	...	-18.954942	3.476423	54	273.151130	0.003491	-432479.565272	1.434597e+05	9.078256e+06	0.717923	-2.152404e+06

442 rows × 21 columns

We have finished preprocessing the training data for our machine learning model. You can either explore the dataset in this notebook or continue with the XGBoost model training notebook.