Predicting air quality using deep learning.
If you read my previous blog post about how I started my journey into deep learning, you might remember that I’ve been studying the fastai course. If you didn’t read it, that’s okay, I’ll paste the intro here. This will help you understand where I’m coming from.
The [fastai] course takes a very practical approach to deep learning. It helps you build things quickly and gradually understand the concepts behind them, busting myths and misconceptions along the way. From day one, the course has you building and training models…
After several weeks of studying, I believe this statement still holds true.
In addition to the various exercises and homework the course provides, I know I learn best if I can find a tangible project to apply whatever I’ve learned. My previous project was to train a convolutional neural network (CNN) to recognize pictures of my cat. I talk about it a little bit here.
Next, I wanted to find a project dealing with structured data like tabular data (CSV) because this is what the course was covering then.
New project
I’ve been living in Montreal, QC for years now and I’m always interested in creating projects about the city. Last summer, we had days with bad air quality due to wildfires raging in the north, and I thought it would be interesting to try to predict the air quality index (AQI) in Montreal using deep learning. Predicting meteorological phenomena is already a well-established field and many researchers are already working at creating accurate models, but I wanted to see how far I could get with the tools I’ve learned so far and the data that’s publicly available.
This blog post will provide a high-level overview of my approach. However, the accompanying Jupyter notebook will be much more detailed and contain all the necessary steps to reproduce my results.
Here’s my plan:
- Research public data
- Preprocess data (feature engineering, cleaning, …)
- Create a training, validation and test set
- Predict past AQI values
Profit
For the impatient, here’s the link to the final Jupyter notebook where you can find the code and replicate the results. Additionally, I’ve extracted all code samples shared in this post into its own notebook as well, so you can run it in parallel.
Research
I was excited about this project, but without data I wouldn’t go very far.
I knew that Montreal kept a website with many different kinds of datasets so I started poking around and found a dataset with measures of air quality covering the city. Four main pollutants are measured: sulfur dioxide (SO2), carbon monoxide (CO), ozone (O3), and fine particulate matter (PM2.5). The dataset contains the computed AQI for each of the pollutants across stations on an hourly basis.
The website references how the AQI value is calculated for each pollutants and I invite you to take a look if you’re interested but this isn’t important for this exercice.
It looks like the whole dataset is split into sections of two years and goes all the way back to 2007. That’s a good chunk of data.
Let’s download the most recent file to see what it looks like.
curl -o aqi-2022-2024.csv -A "some-user-agent" -L https://donnees.montreal.ca/dataset/547b8052-1710-4d69-8760-beaa3aa35ec6/resource/0c325562-e742-4e8e-8c36-971f3c9e58cd/download/rsqa-indice-qualite-air-2022-2024.csv
Note: -A
is used to spoof a user agent because the city seems to (naively) block requests without a user agent. -L
is used to follow the redirect to the actual file. Additionally, it’s worth noting that all code snippets will be in Python from now on, since this is the preferred language for machine learning.
Let’s look at the first few lines of the file.
import pandas as pd
df = pd.read_csv("aqi-2022-2024.csv")
df
stationId | pollutant | value | date | hour |
---|---|---|---|---|
103 | O3 | 15 | 2022-01-15 | 3 |
103 | NO2 | 2 | 2022-01-15 | 3 |
103 | PM | 12 | 2022-01-15 | 3 |
17 | CO | 1 | 2022-02-04 | 21 |
17 | O3 | 17 | 2022-02-04 | 21 |
… | … | … | … | … |
28 | CO | 1 | 2024-06-23 | 15 |
28 | O3 | 25 | 2024-06-23 | 15 |
28 | NO2 | 2 | 2024-06-23 | 15 |
28 | PM | 13 | 2024-06-23 | 15 |
50 | PM | 12 | 2024-06-23 | 15 |
The headers are in French, so we’ll rename them from now on. This will make the post easier to read.
df.rename(columns={
'polluant': 'pollutant',
'valeur': 'value',
'heure': 'hour'
}, inplace=True)
As you can see, we have a value
per stationId
and pollutant
for a given date/hour.
My goal is to predict one AQI value for the whole city at any given datetime. This means we have to prepare the data and get a single AQI for each date/hour so we can use it to train our model.
I decided to take the max
AQI value per stationId
on a given date/hour. This assumes all pollutants contribute equally to the city AQI. This might not be the case, but for our exercice I think it’s okay to make this assumption. Something like this would do:
# Creating a datetime column for easier manipulation later on
df['datetime'] = pd.to_datetime(df['date'] + ' ' + df['hour'].astype(str) + ':00:00',
format = '%Y-%m-%d %H:%M:%S',
errors = 'coerce')
# Dropping columns we won't need anymore
df.drop(["hour", "pollutant", "date"], axis=1, inplace=True)
# Indexing by station and datetime and grab the maximum value for each index (related to our assumption earlier)
df = df.groupby(['stationId', 'datetime']).max("value")
Indexing by station and datetime gives us the maximum recorded value across all pollutants for a given station at a specific time. I think this approach is fairer than taking the maximum value of a given datetime without considering the station, because if one part of the island is more heavily polluted than another, it would negatively pull the AQI upward for the whole territory.
Now that we have the maximum AQI for each station and each datetime, we can take the mean across all stations for a particular date to get a single AQI value.
df = df.groupby("datetime").mean("value").reset_index()
# Sorting the dataframe by datetime for better visualization
df.sort_values("datetime", inplace=True)
df
datetime | value |
---|---|
2022-01-01 00:00:00 | 47.818182 |
2022-01-01 01:00:00 | 53.727273 |
2022-01-01 02:00:00 | 60.272727 |
2022-01-01 03:00:00 | 65.454545 |
2022-01-01 04:00:00 | 65.363636 |
… | … |
2024-06-24 19:00:00 | 16.363636 |
2024-06-24 20:00:00 | 14.636364 |
2024-06-24 21:00:00 | 18.090909 |
2024-06-24 22:00:00 | 24.272727 |
2024-06-24 23:00:00 | 25.727273 |
Nice! This is already good progress.
We could be tempted to include a few more years’ worth of data, process it like we just did, and train our model but I can tell you that it wouldn’t be very accurate (trust me, I tried). I believe this is because the neural net would not be able to extrapolate patterns from the data other than simple variations from time of day and seasonality. From my tests, this wasn’t enough to accurately predict AQI. This must mean that there are other factors at play that we need to consider.
Additional data
I’ve been thinking about what could influence the AQI in Montreal and figured that factors like precipitation, temperature and humidity would be good candidates. Unfortunately, I couldn’t find any up-to-date dataset on the city’s website, so I had to look elsewhere. After a bit of browsing, I realized that the federal government also has public meteorological data available on their website. Just what I needed!
They have an API to request the data, but let’s say that it is questionable. I’ll spare you all the details, but we cannot request data for a range of months/years so we’ll have to download each month’s data individually and concatenate them all! This is fun.
The file from the city of Montreal ran from 2022 to 2024, so we’ll have to match this range for the meteorological data as well. Something like this should do the trick (the notebook parallelizes this process to make it faster):
# This station is located close to the airport and is the one I found holds the most interesting data
station_id="30165"
years = range(2022, 2025)
months = range(1, 13)
# Hourly data
timeframe="1"
dates = [(month, year) for year in years for month in months]
all_files = []
for date in dates:
month, year = date
url = f"https://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID={station_id}&Year={year}&Month={month}&Day=1&timeframe={timeframe}&submit=Download+Data"
all_files.append(pd.read_csv(url))
weather_df = pd.concat(all_files)
weather_df.rename(columns={'Date/Time (LST)': 'datetime', "Temp (°C)": "temp", "Precip. Amount (mm)": "precip", "Rel Hum (%)": "rel_humid"}, inplace=True)
weather_df['datetime'] = pd.to_datetime(weather_df['datetime'])
weather_df = weather_df[['datetime', "temp", "rel_humid", "precip"]]
weather_df
The CSV is large, so I’ve cleaned it up and selected only the columns we care about (there might be other interesting variables, but for now, we’ll stick to these).
datetime | temp | rel_humid | precip |
---|---|---|---|
2022-01-01 00:00:00 | 0.0 | 93.0 | 0.0 |
2022-01-01 01:00:00 | 0.1 | 94.0 | 0.0 |
2022-01-01 02:00:00 | 0.1 | 94.0 | 0.0 |
2022-01-01 03:00:00 | 0.1 | 97.0 | 0.0 |
2022-01-01 04:00:00 | -0.5 | 96.0 | 0.0 |
… | … | … | … |
2024-12-31 19:00:00 | NaN | NaN | NaN |
2024-12-31 20:00:00 | NaN | NaN | NaN |
2024-12-31 21:00:00 | NaN | NaN | NaN |
2024-12-31 22:00:00 | NaN | NaN | NaN |
2024-12-31 23:00:00 | NaN | NaN | NaN |
Alright! Well, this looks promising, but it contains future dates (at least, at the time of writing). We’ll clean these up later.
I’m not a meteorologist but my hypothesis is that these metrics can influence the AQI, so we’ll give them to our model so it can extract patterns from the data. For example, would high temperature and high humidity lead to a worse AQI? I don’t know, but I’m excited to find out!
Combining data
We now have two datasets, one with past AQI values, from which our model could extract AQI variations tied to the time of day and seasonality, and another with meteorological data that could help the model understand the influence of temperature, humidity and precipitation.
To train our model, we’ll have to merge the two pandas dataframes using the datetime
column as the key (both datetime
use the local timezone).
merged_df = pd.merge(df, weather_df, on="datetime", how="left")
merged_df
datetime | value | temp | rel_humid | precip |
---|---|---|---|---|
2022-01-01 00:00:00 | 47.818182 | 0.0 | 93.0 | 0.0 |
2022-01-01 01:00:00 | 53.727273 | 0.1 | 94.0 | 0.0 |
2022-01-01 02:00:00 | 60.272727 | 0.1 | 94.0 | 0.0 |
2022-01-01 03:00:00 | 65.454545 | 0.1 | 97.0 | 0.0 |
2022-01-01 04:00:00 | 65.363636 | -0.5 | 96.0 | 0.0 |
… | … | … | … | … |
2024-06-24 19:00:00 | 16.363636 | 24.2 | 52.0 | 0.0 |
2024-06-24 20:00:00 | 14.636364 | 23.7 | 53.0 | 0.0 |
2024-06-24 21:00:00 | 18.090909 | 22.7 | 61.0 | 0.0 |
2024-06-24 22:00:00 | 24.272727 | 21.5 | 72.0 | 0.0 |
2024-06-24 23:00:00 | 25.727273 | 20.7 | 74.0 | 0.0 |
Looking good.
Feature engineering
We have our data, but one thing I’ve learned is that we can help the training process by engineering new features or normalizing the existing ones. Neural networks are sensitive to how the data is presented.
First, we’ll create new categorical variables from datetime
. Categorical variables (as opposed to continuous variables) hold a finite set of values and will help the model understand the relationship between the categories and potentially leading to better predictions. In this case, our neural network will try to figure out any possible patterns between time and AQI.
Additionally, we’ll want to normalize the AQI values. Deep learning models tend to prefer scales between 0
and 1
so large numbers don’t influence too much the predictions. We’ll divide the AQI by its maximum.
merged_df['year'] = merged_df['datetime'].dt.year
# Year has a bigger range than the rest so we divide it up by its maximum to scale it down.
merged_df['year'] = merged_df['year'] / merged_df['year'].max()
merged_df['month'] = merged_df['datetime'].dt.month
merged_df['day'] = merged_df['datetime'].dt.day
merged_df['hour'] = merged_df['datetime'].dt.hour
merged_df['weekday'] = merged_df['datetime'].dt.weekday
# Values above 100 are extreme outliers (and very rare for Montreal). Clamping helps the model not being influenced too much from these rare events.
merged_df['value'] = merged_df['value'].clip(upper=100)
max_value = merged_df['value'].max()
merged_df['value'] = merged_df['value'] / max_value
merged_df
datetime | value | temp | rel_humid | precip | year | month | day | hour | weekday |
---|---|---|---|---|---|---|---|---|---|
2022-01-01 00:00:00 | 0.456439 | 0.0 | 93.0 | 0.0 | 0.999012 | 1 | 1 | 0 | 5 |
2022-01-01 01:00:00 | 0.517992 | 0.1 | 94.0 | 0.0 | 0.999012 | 1 | 1 | 1 | 5 |
2022-01-01 02:00:00 | 0.586174 | 0.1 | 94.0 | 0.0 | 0.999012 | 1 | 1 | 2 | 5 |
2022-01-01 03:00:00 | 0.640152 | 0.1 | 97.0 | 0.0 | 0.999012 | 1 | 1 | 3 | 5 |
2022-01-01 04:00:00 | 0.639205 | -0.5 | 96.0 | 0.0 | 0.999012 | 1 | 1 | 4 | 5 |
… | … | … | … | … | … | … | … | … | … |
2024-06-24 19:00:00 | 0.128788 | 24.2 | 52.0 | 0.0 | 1.000000 | 6 | 24 | 19 | 0 |
2024-06-24 20:00:00 | 0.110795 | 23.7 | 53.0 | 0.0 | 1.000000 | 6 | 24 | 20 | 0 |
2024-06-24 21:00:00 | 0.146780 | 22.7 | 61.0 | 0.0 | 1.000000 | 6 | 24 | 21 | 0 |
2024-06-24 22:00:00 | 0.211174 | 21.5 | 72.0 | 0.0 | 1.000000 | 6 | 24 | 22 | 0 |
2024-06-24 23:00:00 | 0.226326 | 20.7 | 74.0 | 0.0 | 1.000000 | 6 | 24 | 23 | 0 |
There it is — a single dataset with normalized data that we can use to train our model. There are probably more features we could engineer or experiment with other normalization techniques, but for the purpose of this exercice, I think this is already good enough.
Actually, there’s one more thing we need to do. We need to ensure that we don’t have any missing values. This is important because neural networks need to deal with numbers, and missing values (represented as NaN
in pandas
) aren’t numbers. It is easy enough to handle this with pandas
.
merged_df.isna().sum()
datetime 0
value 0
temp 14
rel_humid 14
precip 22
year 0
month 0
day 0
hour 0
weekday 0
dtype: int64
It looks like we have some missing temperatures, humidity, and precipitation. There are many ways to deal with them, but here’s how I’ve fixed the issues:
merged_df.fillna({"precip": 0,
"temp": merged_df['temp'].fillna(method='bfill'),
"rel_humid": merged_df['rel_humid'].mode()[0]}, inplace=True)
# Asserting we've took care of all missing values
assert merged_df[merged_df.isna().any(axis=1)].empty == True
- For precipitation, I thought assuming that missing values meant there wasn’t any was a fair assumption.
- For temperature, I decided to backfill the missing values. We will take the last known value in the past and reuse it. Because we deal with hourly data, backfilling with a previous value from the last hour or so should be fine. There shouldn’t be any major variation.
- For humidity, taking the mode is a good approach. This means that we’re picking the most common value across the whole dataset.
Ooof, that was a lot!
Training
Training a neural network on structured data like CSVs is usually a cheap-enough process that can be done on a laptop’s GPU or even a CPU. Ultimately, we’ll end up training roughly 10 years’ worth of data through the Jupyter notebook, which should take less than a minute on the free T4 GPUs from Google Colab or any other platform.
With that said, before we can train, we need to create a training and validation set. A training set is what the model uses to learn and extract patterns from the data, and the validation set is used to measure the model’s predictions during training to compute the loss and optimize the model’s weights.
Loss is the penalty for a bad prediction. That is, loss is a number indicating how inaccurate the model’s prediction was on a single example. A high loss means less accurate predictions. Read more.
Something very important to note is that the data from the validation set should be withheld from the training set. If you validate on the same data you’ve trained on, there is no guarantee that the model will be able to generalize to new data, leading to overfitting. In other words, the model will remember the data too well, and won’t be able to make predictions on unseen data. This will be completely useless for you or your organization and potentially a waste of time and money.
Depending on the nature of the data you’re dealing with, there are various ways of splitting it up into a training and validation set. There are a lot of cases where taking out a random subset of data points from the training set and shuffling them to create the validation set is good enough. However, in our case, we’re dealing with time series data and we need to be careful because there is an intrinsic order that needs to be preserved. Time is an important factor when predicting AQI.
For the purpose of this blog post, we’ve only been dealing with data from 2022 to 2024. This won’t be enough to carry out accurate predictions, but we’ll go through with this exercice. Remember that the notebook will be dealing with much more data.
Let’s split our datasets.
import numpy as np
date_valid = pd.Timestamp('2023-01-01')
date_test = pd.Timestamp('2024-01-01')
# Arbitrary date. This ensures we won't deal with future dates in our data set (unless you're reading this in 2025)
now = pd.Timestamp('2024-06-01')
# From Jan 01, 2022 -> Dec 31, 2022
train_idx = merged_df['datetime'] < date_valid
# From Jan 01, 2023 -> Dec 31, 2023
valid_idx = (merged_df['datetime'] >= date_valid) & (merged_df['datetime'] < date_test)
# From Jan 01, 2024 -> May 31, 2024
test_idx = (merged_df['datetime'] >= date_test) & (merged_df['datetime'] < now)
train_idxs = np.where(train_idx)[0].tolist()
valid_idxs = np.where(valid_idx)[0].tolist()
test_idxs = np.where(test_idx)[0].tolist()
We’ve created a list of indexes we’re going to use to reference rows in our data. Notice how we’ve created an additional test_idxs
. Previously, we talked about a training set and a validation set. We can also create a test set that is completely excluded from the training process. This is a precaution in case we’ve overfitted our model on the validation set. We’ll use this set to make sure our model can properly generalize.
We can verify that each of our sets ends at the correct date.
assert merged_df.iloc[train_idxs[-1]]['datetime'] == pd.Timestamp('2022-12-31 23:00:00')
assert merged_df.iloc[valid_idxs[-1]]['datetime'] == pd.Timestamp('2023-12-31 23:00:00')
assert merged_df.iloc[test_idxs[-1]]['datetime'] == pd.Timestamp('2024-05-31 23:00:00')
Finally, we have a continuous training, validation, and test set. We can now train our model and test its predictions. As mentioned at the beginning of this post, I’m studying the fastai
course, so naturally, we’ll train using fastai
.
from fastai.tabular.all import *
# Split variables into categorical and continuous (anything that has over 20 values is considered continuous)
cont,cat = cont_cat_split(merged_df, max_card=20, dep_var='value')
dls = TabularPandas(
merged_df,
procs=[Categorify, Normalize],
cat_names=cat,
cont_names=cont,
y_names=['value'],
splits=(train_idxs, valid_idxs),
y_block=RegressionBlock()
).dataloaders(bs=2048)
Let’s see what is going on here. Training a tabular model using fastai
requires creating a TabularPandas
object.
- We pass the
merged_df
dataframe toTabularPandas
. - We specify the
procs
we want to apply to the data. Think of procs as additional pre-processing steps thatfastai
can handle for us, such as categorifying categorical variables and normalizing continuous ones - We also specify which variables are categorical and continuous (provided by
cont_cat_split
) - We tell
fastai
which column is the dependent variable (the one we want to predict, often namedy
). - We provide our data split into training and validation
- Finally, we tell
fastai
we’re dealing with a regression problem (predicting a continuous value).
Lastly, we turn this TabularPandas
object into a Dataloader
with a batch size (bs
) of 2048. You can use any number here but remember, larger batch sizes are better for GPUs as they can work faster, but be careful not to run out of memory.
We can now proceed with training. We’ll use a model with two hidden layers with respectively 250
and 100
activations each. Our choice of loss function will be Mean Absolute Error (MAE)
.
We’ll also specify that we want to predict a number between 0
and 1
. We train for a couple of epochs…
# In fastai, models are called learner
learn = tabular_learner(dls, metrics=mae, layers=[250, 100], y_range=(0, 1))
# 10 epochs
learn.fit_one_cycle(10)
Epoch | Train Loss | Valid Loss | MAE |
---|---|---|---|
0 | 0.124582 | 0.044517 | 0.189672 |
1 | 0.116288 | 0.665172 | 0.810327 |
2 | 0.109973 | 0.665172 | 0.810327 |
3 | 0.103871 | 0.665172 | 0.810327 |
4 | 0.098426 | 0.665172 | 0.810327 |
5 | 0.093807 | 0.665172 | 0.810327 |
6 | 0.089975 | 0.665172 | 0.810327 |
7 | 0.086668 | 0.665172 | 0.810327 |
8 | 0.083967 | 0.665172 | 0.810327 |
9 | 0.081800 | 0.665172 | 0.810327 |
Voila! It hasn’t trained, not one bit.
How can I tell? The train_loss
decreases, but the valid_loss
stays the same. This is often due to overfitting, the model cannot generalize. We can also verify this by using our test set and measuring its loss.
df_test = merged_df.loc[test_idxs]
test_dl = dls.test_dl(df_test)
preds, targets = learn.get_preds(dl=test_dl)
mae(preds, targets)
-> TensorBase(0.8181)
Not great. This is most likely due to the fact that we don’t have enough data. Remember, we’ve trained on one year of data from 2022 to 2023 and then validated using data from 2023 to 2024.
Let’s validate this theory by adding a bit more data to see if our loss improves. I’ve naively copied and pasted all the code we’ve written before into one cell, but expanded the date range from 2019 to 2024.
1. Download files from the city’s database
curl -o aqi-2022-2024.csv -A "some-user-agent" -L https://donnees.montreal.ca/dataset/547b8052-1710-4d69-8760-beaa3aa35ec6/resource/0c325562-e742-4e8e-8c36-971f3c9e58cd/download/rsqa-indice-qualite-air-2022-2024.csv
curl -o aqi-2019-2021.csv -A "some-user-agent" -L https://donnees.montreal.ca/dataset/547b8052-1710-4d69-8760-beaa3aa35ec6/resource/e43dc1d6-fbdd-49c3-a79f-83f63404c281/download/rsqa-indice-qualite-air-2019-2021.csv
2. Preprocess the past AQI dataset
df1 = pd.read_csv("./aqi-2022-2024.csv")
df2 = pd.read_csv("./aqi-2019-2021.csv")
df = pd.concat([df1, df2])
df.rename(columns={
'polluant': 'pollutant',
'valeur': 'value',
'date': 'date',
'heure': 'hour'
}, inplace=True)
df['datetime'] = pd.to_datetime(df['date'] + ' ' + df['hour'].astype(str) + ':00:00',
format = '%Y-%m-%d %H:%M:%S',
errors = 'coerce')
df.drop(["hour", "pollutant", "date"], axis=1, inplace=True)
df = df.groupby(['stationId', 'datetime']).max("value")
df = df.groupby("datetime").mean("value").reset_index()
df.sort_values("datetime", inplace=True)
3. Download meteorological data
Notice how I’ve expanded the date range to start in 2019. I’ve also taken the liberty of parallelizing the process.
import pandas as pd
import concurrent.futures
station_id = "30165"
years = range(2019, 2025)
months = range(1, 13)
timeframe = "1"
dates = [(month, year) for year in years for month in months]
def download_data(date):
month, year = date
url = f"https://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID={station_id}&Year={year}&Month={month}&Day=1&timeframe={timeframe}&submit=Download+Data"
return pd.read_csv(url)
with concurrent.futures.ThreadPoolExecutor() as executor:
all_files = executor.map(download_data, dates)
weather_df = pd.concat(all_files)
weather_df.rename(columns={'Date/Time (LST)': 'datetime', "Temp (°C)": "temp", "Precip. Amount (mm)": "precip", "Rel Hum (%)": "rel_humid"}, inplace=True)
weather_df['datetime'] = pd.to_datetime(weather_df['datetime'])
weather_df = weather_df[['datetime', "temp", "rel_humid", "precip"]]
4. Adding features and normalization
merged_df = pd.merge(df, weather_df, on="datetime", how="left")
merged_df['year'] = merged_df['datetime'].dt.year
merged_df['year'] = merged_df['year'] / merged_df['year'].max()
merged_df['month'] = merged_df['datetime'].dt.month
merged_df['day'] = merged_df['datetime'].dt.day
merged_df['hour'] = merged_df['datetime'].dt.hour
merged_df['weekday'] = merged_df['datetime'].dt.weekday
merged_df['value'] = merged_df['value'].clip(upper=100)
max_value = merged_df['value'].max()
merged_df['value'] = merged_df['value'] / max_value
merged_df.fillna({"precip": 0,
"temp": merged_df['temp'].fillna(method='bfill'),
"rel_humid": merged_df['rel_humid'].mode()[0]}, inplace=True)
5. Split the datasets
Our training set now runs from 2019 to 2023. Previously, we trained from 2022 to 2023.
import numpy as np
from fastai.tabular.all import *
date_valid = pd.Timestamp('2023-01-01')
date_test = pd.Timestamp('2024-01-01')
now = pd.Timestamp('2024-06-01')
# From Jan 01, 2019 -> Dec 31, 2022
train_idx = merged_df['datetime'] < date_valid
# From Jan 01, 2023 -> Dec 31, 2023
valid_idx = (merged_df['datetime'] >= date_valid) & (merged_df['datetime'] < date_test)
# From Jan 01, 2024 -> June 01, 2024
test_idx = (merged_df['datetime'] >= date_test) & (merged_df['datetime'] < now)
train_idxs = np.where(train_idx)[0].tolist()
valid_idxs = np.where(valid_idx)[0].tolist()
test_idxs = np.where(test_idx)[0].tolist()
cont,cat = cont_cat_split(merged_df, max_card=20, dep_var='value')
dls = TabularPandas(
merged_df,
procs=[Categorify, Normalize],
cat_names=cat,
cont_names=cont,
y_names=['value'],
splits=(train_idxs, valid_idxs),
y_block=RegressionBlock()
).dataloaders(bs=2048)
6. Train
learn = tabular_learner(dls, metrics=mae, layers=[250, 100], y_range=(0, 1))
learn.fit_one_cycle(10)
Epoch | Train Loss | Valid Loss | MAE |
---|---|---|---|
0 | 0.138118 | 0.071449 | 0.259237 |
1 | 0.126728 | 0.071438 | 0.254919 |
2 | 0.110009 | 0.093728 | 0.286704 |
3 | 0.091794 | 0.076902 | 0.249465 |
4 | 0.075047 | 0.057618 | 0.200717 |
5 | 0.060928 | 0.046950 | 0.167216 |
6 | 0.049792 | 0.042691 | 0.153482 |
7 | 0.041468 | 0.041640 | 0.145898 |
8 | 0.035485 | 0.041033 | 0.145078 |
9 | 0.031253 | 0.041656 | 0.146126 |
Much better! Look how our loss (MAE
) decreased this time around. This is a good sign that our model is learning to generalize. Let’s validate against our test set.
df_test = merged_df.loc[test_idxs]
test_dl = dls.test_dl(df_test)
preds, targets = learn.get_preds(dl=test_dl)
mae(preds, targets)
-> TensorBase(0.2495)
On our previous training session, we ended up with a mean absolute error of 0.8522
and now we got 0.2495
. This is a big improvement. Remember, the lower the loss the better our model becomes. We also used our test set to measure its accuracy so we have some guarantee that our model is able to generalize to unseen data.
The last bit for fun, let’s plot the predictions against the actual values for the test set.
def mk_analysis_df(df, preds):
# Create DataFrame
_df = pd.DataFrame({
'datetime': df['datetime'],
'Actual': df['value'] * max_value,
'Predicted': preds.flatten() * max_value,
})
_df['Error'] = abs((_df['Actual'] - _df['Predicted']) / _df['Actual']) * 100
return _df
export_df = mk_analysis_df(df_test, preds.flatten())
export_df.rename(columns={'datetime': 'date'}, inplace=True)
json_output = export_df.to_json(orient='records', indent=2)
json_dict = json.loads(json_output)
result_dict = {"results": json_dict}
json.dumps(result_dict)
- Actual
- Predicted
I know it doesn’t look like it, but this isn’t too bad.
We can see that there are some correlations between the actual and predicted values. Some of the variation look similar, but the model always predicts too high. This can be corrected by feeding in a lot more data so the model can more accurately understand the influence of our variables. This is not amazing, but it is a good start. You might get slightly different results as this is not a deterministic process.
If you’re interested, I’ve plotted my own results here after using the Jupyter notebook.
Conclusion
Wow, if you made it this far, congratulations! I know this was a lengthy post, but we went through quite a lot of steps, from collecting data to cleaning it up and feeding it to a model during training. I’ve learned a lot from this exercice, and I hope you’ve learned a thing or two as well. There are probably thousands of ways to improve my approach and create new experiments with more data, including wind, individual pollutants or even other external sources like traffic information. This is all very exciting, and who knows, perhaps the city council of Montreal would be interested in this project as well 😎.
Don’t forget to checkout the Jupyter notebook if you want to see results on a bigger data set, as well as my various explorations.
I hope you’ve enjoyed this post and that it has been helpful to you. If you have any questions or feedback, feel free to reach out to me.