Regression case study part one: New Zealand's greenhouse gas emissions

Python Statistics and data science

31 January, 2021

Check out part two here.


This is the first of two posts describing the development and evaluation of a regression model in Python. In this first section, I'll describe the data itself—data on New Zealand's greenhouse gas emissions from 1990 to 2016—and in the second section, we'll look at how to construct a simple linear regression model. The second post in this series delves into multiple regression, and more advanced ways to evaluate model fit and specification.

Part one: the data


This project uses the New Zealand’s greenhouse gas emissions 1990–2016 dataset. The data were sourced from the Greenhouse Gas Inventory of New Zealand's Ministry for the Environment (MfE), and were released as part of MfE's Environment Aotearoa 2019 report. The Greenhouse Gas Inventory is produced by MfE every year, and forms part of New Zealand's reporting obligations under the United Nations Framework Convention on Climate Change and the Kyoto Protocol.

The dataset contains measurements of gases added to the atmosphere by human activities in New Zealand—that is, excluding natural sources of greenhouse gases, such as volcanic emissions and biological processes—from 1990 to 2016. The data were aggregated into sector-specific categories, and comprise 306 distinct numerical measurements of greenhouse gas emissions, spread across 17 years and five main sectors. The entire dataset is used for this project.


The data were obtained experimentally, by measuring the amount of greenhouse gas emissions that different sectors in New Zealand produce. The data were collected from 1990 to 2016, across New Zealand and by different reporting bodies.

While MfE compiles the Greenhouse Gas Inventory, the responsibility for reporting greenhouse gas emissions falls upon different government agencies, according to sector:

Finally, MfE provide further guidance on the measurement and reporting of greenhouse gas emissions in New Zealand.


The dataset comprises four variables:

The data were obtained in a wide format with the "Year" variable spread across columns. The numerical measure of greenhouse gas emissions is expressed in carbon dioxide equivalent (CO2-e) units, which measures how much global warming a given type and amount of greenhouse gas causes. This is used to enable us to compare the effect of different types of greenhouse gas in a consistent way, so that we can compare the relative contribution of different gases to the greenhouse effect, regardless of the actual amounts emitted. The values ranged from less than –30,000 CO2-e to over 80,000 CO2-e units. Negative values of CO2-e were attributed to the Land use sector due to the carbon sequestration effects of forestry. The emission sector and the year were treated as potential explanatory variables, while the CO2-e unit values were treated as the response variable.

Part two: simple regression

First steps with Python

Before we construct our model, let's have a quick look at the data, which were downloaded on 2 February 2020.

import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf

dat = (
            var_name='Year', value_name='Units')
            'Year': 'int32',
            'Gas': 'category',
            'Source': 'category'    


We've used pandas.melt to pivot the data to a long format, giving us Gas, Source and Year explanatory variables. Our Units variable contains the measured CO2-e units for each observation.

One thing you might notice straight away is that there is a negative number in the Units field. This occurs when Source is equal to "Land use, land-use change and forestry". Also known by its abbreviation of LULUCF, this category represents both emissions and removal of emissions from the atmosphere due to land-use changes. In New Zealand, for this period, the values are negative, indicating that LULUCF was a net remover of greenhouse gas emissions, rather than producer.

Keeping that in mind, I thought it'd be a good idea to whip up a quick plot showing the change of net greenhouse gas emissions (i.e., including LULUCF and the role of forestry in removing emissions from the atmosphere) compared to gross greenhouse gas emissions (ignoring removal) over the period of the data we have available:

summary = dat.query("Gas == 'All gases' & Source.isin(['All sources, Net (with LULUCF)', 'All sources, Gross (excluding LULUCF)'])")
summary.Source =
sns.lineplot(data=summary, x='Year', y='Units', hue='Source')

It's clear that LULUCF really does play an important role in mediating New Zealand's greenhouse gas emissions. But back to the point of this post, which is to demonstrate simple regression with Python—let's construct a model that uses Year as the explanatory variable, and net emissions as the response variable. Basically, we'll see if that trend that we see with our own two eyes—emissions increasing over time—is statistically significant in a linear regression model.

I should add at this point that, of course, Year is absolutely not a real explanatory variable. Arbitrarily changing the year shouldn't change anything about greenhouse gas emissions. Instead, Year will be correlated with the true explanatory variables, like increases in number of cars on the road, increases in number of dairy farms, etc.

The model

Actually creating a simple linear regression model is really straightforward. We use the statsmodels.formula.api function ols (for ordinary least-squares), and fit the model on the data for net greenhouse gas emission units:

net_summary = summary[summary.Source == "All sources, Net (with LULUCF)"]
reg = smf.ols('Units ~ Year', data=net_summary).fit()

Printing the output of the model gives us a bunch of useful (and at first confusing) information:

Underneath these statistics is the coefficient table, with two entries: Intercept and Year.

And that's basically it for our first model. Our fit suggests that Year is a pretty good predictor of net greenhouse gas emissions, with emissions tending to increase by between around 700 to 1100 CO2-e units per year!

Next, we're going to look at multiple regression and evaluating model fit.

You can find all the code used in this post here.



Leave a comment