November 15, 2019 / by /

### Outlier detection in Air Pollution data

On a daily basis, economical, technological and political changes cause air pollution, which is currently one of the most pressing environmental issues around the globe. It is particularly acute in large cities with the rising level of industrialization and growing urbanization. The polluted air we breathe results in serious health disorders and decreases the quality of life. Before steps are taken to decrease the harmful effect of daily produced pollutants, certain measurements are required.

Government agencies use **air quality index** to communicate with the
public. An example of such communication is recommendations to reduce or
avoid outdoor physical exertion when the air quality index (AQI) is too
high. Different countries have their own air quality indices and they
use their own method of calculation and health risk categories.
Typically, outdoor^{1} air pollution can be defined as the emission of
harmful substances into the atmosphere. This broad definition
encapsulates a number of pollutants, including:

- sulphur dioxide ($SO_2$),
- nitrogen oxides ($NO_x$),
- ozone ($O_3$),
- particulate matter,
- carbon monoxide ($CO$)
- carbon dioxide ($CO_2$)
- and volatile organic compounds ($VOC_s$).

We took the historical data of annual emissions of carbon dioxide per person, based on territorial emissions derived by Our World in Data. Per capita emissions (production-based) were calculated dividing the total national $CO_2$ emissions per year by population estimates. The summary of the dataset is presented below. The number of observations in the initial data is 42,844 and the range of variable time span is 1751 – 2017. For the further analysis the observation before 1970 that have value of zero are omitted. The dataset used in this analysis can be found here.

```
# load the required libraries used in the article
if (!require("pacman")) install.packages("pacman")
pacman::p_load(ggplot2, dplyr, kableExtra, tsoutliers, forecast, TSA)
```

```
co <- read.csv("co-emissions-per-capita.csv")
cond <- which(co$Year < 1970 & co$Co2 == 0)
co <- co[-cond,]
```

We can look at the summary of the dataframe.

```
options(knitr.kable.NA = '')
knitr::kable(summary(co), digits=2) %>%
kable_styling(bootstrap_options = "striped", full_width = F)
```

Entity | Code | Year | Co2 | |
---|---|---|---|---|

Canada : 218 | CAN : 218 | Min. :1751 | Min. : 0.0000 | |

Germany : 218 | DEU : 218 | 1st Qu.:1948 | 1st Qu.: 0.2029 | |

Poland : 218 | GBR : 218 | Median :1975 | Median : 1.0925 | |

United Kingdom: 218 | POL : 218 | Mean :1965 | Mean : 3.5506 | |

United States : 218 | USA : 218 | 3rd Qu.:1996 | 3rd Qu.: 4.4751 | |

France : 210 | FRA : 210 | Max. :2017 | Max. :252.6451 | |

(Other) :15560 | (Other):15560 |

The variables `Entity`

and `Code`

show the country and/ or area and its
code, where the level of carbon dioxide was measured. The variables
`Year`

and `Co2`

show the time span of variable `Co2`

and the carbon
dioxide measured in tones ($CO_2$ data has been converted from tones of
carbon to tones of carbon dioxide ($CO_2$) using a conversion factor of
3.664).

We will now explore the pattern of $CO_2$ by looking at the data of 3 particular countries: Venezuela, United Arab Emirates, and the Bahamas.

```
co %>% filter(Entity %in% c("Venezuela", "United Arab Emirates", "Bahamas")) %>%
ggplot() + geom_line(aes(x = Year, y = Co2, color = Entity)) +
theme(legend.position="None") + facet_grid(.~Entity, scales = "fixed") +
ggtitle("The emissions of CO2 per capita")
```

These countries were not selected randomly. If you look at the time
series of these countries separately, the different types of outliers:
**additive, innovational**, and **level shift outliers** can be found in
the time series of $CO_2$ in Venezuela, the United Arab Emirates and the
Bahamas correspondingly. Outliers are usually the unexpected spikes or
dips of observations at given points in time.

# Outliers and detection

Time series observations reveal frequent shifts in the data. The shifts
in a time series that cannot be explained are referred to as
**outliers**. One important reason to detect outliers is that they can
represent an error. At the same time, these observations are
inconsistent with the rest of the series and can dramatically influence
the analysis, thus affecting the forecasting capacity of the time series
model. So, to increase the prediction accuracy they must be removed
before the analysis. Here are the most common types of outliers:

- Additive Outliers
- Level Shifts
- Transient Change
- Innovation Outliers
- Seasonal Level Shifts

Three of the aforementioned outliers are to be considered below. There
are different tools and methods to detect and deal with outliers (time
series decomposition, generalized extreme student deviation, ARIMA,
etc.). In order to detect the presence of outliers in the time series we
will use the R function `tso()`

, an interface to the automatic
detection procedure provided in the package
tsoutliers.
The R package `tsoutliers`

implements the **Chen and Liu** procedure for
the detection of outliers in time series. Different types of outliers
(by default: **AO** additive outliers; **LS** level shifts, **IO**
innovative outliers, etc.) can be selected in the function. The proposed
procedure may be applied both to general seasonal and nonseasonal ARMA
processes.

## Additive Outliers

An **additive outlier** (*AO*) corresponds to an exogenous change of a
**single observation** of the time series. Subsequent observations are
unaffected by an additive outlier. Due to external causes, this type of
outlier is usually associated with **isolated** incidents like
measurement errors or impulse effects.

The graph indicates that the time series of the following countries can have an additive outlier/outliers:

- Brunei
- Mauritania
- Togo
- The United Kingdom
- Venezuela
- Yemen

```
AO <- co[co$Entity %in% c("Brunei", "Mauritania", "Togo", "United Kingdom", "Venezuela", "Yemen"), ]
ggplot(data = AO) + geom_line(aes(x = Year, y = Co2, color = Entity)) +
theme(legend.position="None") + facet_wrap(~Entity, scales = "free") +
ggtitle("The emissions of CO2 per capita")
```

We will now consider the time series of Venezuela and United Kingdom.
The plot above shows that the value of the year 1946 (17 tonnes of
carbon dioxide) in Venezuela is not typical for the data. To check this
assumption the `tso`

function can be used.

```
tsoutliers::tso(ts(AO[AO$Entity == "Venezuela", "Co2"],
start = AO[AO$Entity == "Venezuela", "Year"][1]), types = "AO")
```

```
## Series: ts(AO[AO$Entity == "Venezuela", "Co2"], start = AO[AO$Entity == "Venezuela", "Year"][1])
## Regression with ARIMA(0,1,0) errors
##
## Coefficients:
## AO34
## 13.1953
## s.e. 0.4437
##
## sigma^2 estimated as 0.3976: log likelihood=-99.11
## AIC=202.21 AICc=202.33 BIC=207.5
##
## Outliers:
## type ind time coefhat tstat
## 1 AO 34 1937 13.2 29.74
```

We will now study the value of CO2 in Venezuela tested as an outlier (with the index of 34).

```
AO[AO$Entity == "Venezuela",][34,]
```

```
## Entity Code Year Co2
## 41780 Venezuela VEN 1946 17.03064
```

The procedure first detects outliers upon a chosen ARIMA model (which is
an iterative process), then it chooses the ARIMA model including the
outliers detected in the previous step and removes the outliers that are
not significant in the new fit. The procedure is repeated until no
outliers are detected or until a maximum number of iterations (by
default, 4) is reached (Chen and Liu, 2013). There is yet another `TSA`

function that can be used to check the existence of additive outliers.

```
detectAO(arima(ts(AO[AO$Entity == "Venezuela", "Co2"]),
order = c(0, 0, 0)), alpha = 0.05, robust = TRUE)
```

```
## [,1]
## ind 34.000000
## lambda2 4.238379
```

The time series of Brunei also detects an outlier between the years of
1947 and 1949. These, however, are not the only types of outliers data
can have. For example, the value of 1970 can be considered a level shift
outlier. Different types of outliers can be simultaneously checked using
the vector of possible types in the `tso()`

function. Although, Chung
and Liu (2013) state that, from a computational standpoint, the strategy
of detecting outliers one by one may be the only feasible approach to
dealing with multiple outliers, it seems more appropriate to estimate
the outlier effects jointly rather than sequentially. Besides, a
procedure based solely on iteratively adjusted residuals may often
result in biased estimates for adjacent outliers.

```
tsoutliers::tso(ts(AO[AO$Entity == "United Kingdom", "Co2"],
start = AO[AO$Entity == "United Kingdom", "Year"][1]), types = c("AO","LS", "IO"))
```

```
## Series: ts(AO[AO$Entity == "United Kingdom", "Co2"], start = AO[AO$Entity == "United Kingdom", "Year"][1])
## Regression with ARIMA(1,2,2) errors
##
## Coefficients:
## ar1 ma1 ma2 AO94 IO122 AO127
## 0.5189 -1.7097 0.7403 -1.2028 -0.1929 -4.6164
## s.e. 0.1783 0.1424 0.1436 0.3247 0.1766 0.3261
##
## sigma^2 estimated as 0.181: log likelihood=-120.34
## AIC=254.68 AICc=255.22 BIC=278.31
##
## Outliers:
## type ind time coefhat tstat
## 1 AO 94 1893 -1.2028 -3.704
## 2 IO 122 1921 -0.1929 -1.092
## 3 AO 127 1926 -4.6164 -14.155
```

```
AO[AO$Entity == "United Kingdom", ][c(94,122,127),]
```

```
## Entity Code Year Co2
## 40421 United Kingdom GBR 1893 8.958951
## 40449 United Kingdom GBR 1921 7.245594
## 40454 United Kingdom GBR 1926 5.437124
```

As assumed, the levels of $CO_2$ in the year of 1893 and 1926 are treated as outliers. In addition, the algorithm found innovational outlier (see details about this type of outlier below) in the year 1921.

In forecasting models, removing outliers may be highly dangerous. If the
forecasting models are sensitive to outliers, we need to replace them,
because outliers will skew the coefficients. To deal with additive
outliers, the `tsoutliers()`

function from the package `forecast`

,
designed to identify and to suggest potential replacement values, can be
used. In our data of the United Kingdom there are three outliers and,
correspondingly, three replacements:

```
tsoutliers(ts(AO[AO$Entity == "United Kingdom", "Co2"],
start = AO[AO$Entity == "United Kingdom", "Year"][1]))
```

```
## $index
## [1] 94 122 127
##
## $replacements
## [1] 10.12905 10.18127 10.10963
```

```
replUnKing <- ts(AO[AO$Entity == "United Kingdom", "Co2"],
start = AO[AO$Entity == "United Kingdom", "Year"][1])
replUnKing[c(94, 122, 127)] <- c(10.12905, 10.18127, 10.10963)
```

An alternative useful function identifying and replacing outliers is
`tsclean()`

from the same package. In this case, linear interpolation is
used to estimate missing values and to replace outliers.

```
replUnKing <- tsclean(ts(AO[AO$Entity == "United Kingdom", "Co2"],
start = AO[AO$Entity == "United Kingdom", "Year"][1]))
```

```
out <- data.frame(
UK = ts(AO[AO$Entity == "United Kingdom", "Co2"],
start = AO[AO$Entity == "United Kingdom", "Year"][1]),
replUnKing, Year = AO[AO$Entity == "United Kingdom", "Year"])
ggplot(data = out) +
geom_line(aes(x = Year, y = UK, color="#FC4E07"), size =1.1, alpha = 0.5) +
geom_line(aes(x = Year, y = replUnKing, color="#E7B800")) +
scale_color_discrete(name = " ", labels = c("Replaced", "UK CO2"))+
labs(y = "CO2", title = "Replacing the outliers")
```

## Innovational Outliers

**Innovational Outlier** or Intervention Outlier (*IO*) corresponds to
an endogenous change of a single innovation of the time series and is
usually associated with isolated incidents such as an impulse. The
influence of the outliers may increase over time. It represents a shock.

The following countries are selected to visually study the case of having innovational outlier/outliers:

- Cape Verde
- Iran
- Germany
- Armenia
- The United Arab Emirates
- Mauritania

```
IO <- co[co$Entity %in% c("Cape Verde", "Iran", "Germany", "Armenia",
"United Arab Emirates", "Mauritania"), ]
ggplot(data = IO) +
geom_line(aes(x = Year, y = Co2, color = Entity)) +
theme(legend.position="None") +
facet_wrap(~Entity,scales = "free") +
ggtitle("The emissions of CO2 per capita")
```

The time series for the United Arab Emirates shows innovational outliers in 1969 and 1998, which are to be tested:

```
tsoutliers::tso(ts(IO[IO$Entity == "United Arab Emirates","Co2"],
start = IO[IO$Entity == "United Arab Emirates", "Year"][1]),types = c("LS","IO"))
```

```
## Series: ts(IO[IO$Entity == "United Arab Emirates", "Co2"], start = IO[IO$Entity == "United Arab Emirates", "Year"][1])
## Regression with ARIMA(2,0,0) errors
##
## Coefficients:
## ar1 ar2 IO11 LS26 IO40
## 1.3633 -0.4085 66.2302 9.0712 9.2042
## s.e. 0.1161 0.1162 3.6876 6.0046 3.6875
##
## sigma^2 estimated as 44.8: log likelihood=-194.83
## AIC=401.67 AICc=403.29 BIC=414.13
##
## Outliers:
## type ind time coefhat tstat
## 1 IO 11 1969 66.230 17.960
## 2 LS 26 1984 9.071 1.511
## 3 IO 40 1998 9.204 2.496
```

```
IO[IO$Entity == "United Arab Emirates",][c(11,40),]
```

```
## Entity Code Year Co2
## 40279 United Arab Emirates ARE 1969 100.61529
## 40308 United Arab Emirates ARE 1998 28.42788
```

The algorithm detected a level shift outlier in the year 1984 (see details below).

Armenia’s level of $CO_2$ has an innovational outlier in 1993. It can
also be seen through the function `detectIO()`

and can be visually
studied in the graph above.

```
tsoutliers::tso(ts(IO[IO$Entity == "Armenia", "Co2"],
start = IO[IO$Entity == "Armenia", "Year"][1]),types = c("IO", "AO"))
```

```
## Series: ts(IO[IO$Entity == "Armenia", "Co2"], start = IO[IO$Entity == "Armenia", "Year"][1])
## Regression with ARIMA(0,1,0) errors
##
## Coefficients:
## IO35
## -0.9156
## s.e. 0.1209
##
## sigma^2 estimated as 0.01488: log likelihood=40.23
## AIC=-76.45 AICc=-76.24 BIC=-72.33
##
## Outliers:
## type ind time coefhat tstat
## 1 IO 35 1993 -0.9156 -7.571
```

The number 35 is tagged as an innovative outlier.

```
IO[IO$Entity == "Armenia",][35,]
```

```
## Entity Code Year Co2
## 1720 Armenia ARM 1993 0.7458196
```

Detect innovative outlier with detectIO().

```
detectIO(arima(ts(IO[IO$Entity == "Armenia", "Co2"]),
order = c(0, 1, 0)), alpha = 0.05, robust = TRUE)
```

```
## [,1] [,2]
## ind 35.00000 51.000000
## lambda1 -7.49181 -3.348504
```

## Level Shift Outliers

**Level Shift Outlier** (*LS*) represents an abrupt change in the mean
level and may or may not be seasonal (Seasonal Level Shift). LS is a
change in the mean level of the time series starting at some point and
continuing until the end of the observed period. In contrast to additive
outliers, a level shift outlier affects **many observations** and has a
**permanent** effect. It can be modelled in terms of step function with
the magnitude equal to the omega parameter.

The countries below are selected to visually study the case of having level shift outlier/outliers:

- Iceland
- North Korea
- Yemen
- Bahamas
- Lithuania

```
LS <- co[co$Entity %in% c("Iceland", "North Korea", "Yemen", "Bahamas", "Lithuania"), ]
ggplot(data = LS) + geom_line(aes(x = Year, y = Co2, color = Entity)) +
theme(legend.position="None") + facet_wrap(~Entity,scales = "free") +
ggtitle("The emissions of CO2 per capita")
```

A similar analysis for the level shift outlier was conducted for the time series of Iceland and Bahamas. The results are presented below:

```
tsoutliers::tso(ts(LS[LS$Entity == "Iceland", "Co2"],
start = LS[LS$Entity == "Iceland", "Year"][1]), types = c("LS", "AO"))
```

```
## Series: ts(LS[LS$Entity == "Iceland", "Co2"], start = LS[LS$Entity == "Iceland", "Year"][1])
## Regression with ARIMA(0,1,0) errors
##
## Coefficients:
## LS12
## 5.1622
## s.e. 0.5347
##
## sigma^2 estimated as 0.2897: log likelihood=-61.85
## AIC=127.7 AICc=127.86 BIC=132.41
##
## Outliers:
## type ind time coefhat tstat
## 1 LS 12 1947 5.162 9.654
```

As can be seen from the output above, case number 12 is tagged as an outlier.

```
LS[LS$Entity == "Iceland",][12,]
```

```
## Entity Code Year Co2
## 16501 Iceland ISL 1950 5.188092
```

And for Bahamas:

```
tsoutliers::tso(ts(LS[LS$Entity == "Bahamas", "Co2"],
start = LS[LS$Entity == "Bahamas", "Year"][1]),types = "LS")
```

```
## Series: ts(LS[LS$Entity == "Bahamas", "Co2"], start = LS[LS$Entity == "Bahamas", "Year"][1])
## Regression with ARIMA(3,0,0) errors
##
## Coefficients:
## ar1 ar2 ar3 LS22 LS32
## 0.1292 0.525 0.2965 26.6765 -23.6281
## s.e. 0.1150 0.097 0.1150 1.9063 1.8062
##
## sigma^2 estimated as 6.589: log likelihood=-159.16
## AIC=330.33 AICc=331.71 BIC=343.65
##
## Outliers:
## type ind time coefhat tstat
## 1 LS 22 1971 26.68 13.99
## 2 LS 32 1981 -23.63 -13.08
```

```
LS[LS$Entity == "Bahamas",][c(22,32),]
```

```
## Entity Code Year Co2
## 2570 Bahamas BHS 1971 38.67267
## 2580 Bahamas BHS 1981 12.99207
```

A dummy variable can be used to account for level shift outliers and innovative outliers in the data. Rather than omitting the outlier, a dummy variable removes its effect. In the case of one outlier, the dummy variable takes value 1 for the observations before the outliers occur and 0 for other observations. For two and more outliers coding with several dummy variables can be implemented.

# Conclusions

In the article, we considered three types of outliers (AO, IO, and LS) in time series modeling. Here is the summary of these types of outliers:

- An Additive Outlier (AO) represents an isolated spike.
- A Level Shift (LS) represents an abrupt change in the mean level.
- An Innovational Outlier (IO) represents a shock.

To illustrate, we described the time series of annual emissions of
carbon dioxide per person to detect these outliers. The iterative
procedure of finding an outlier from the package `tsoutliers`

was used.
The next step in outlier analysis should be their adjustment.

# Reference list

Chen, C. and Liu, L. (2013) ‘Joint Estimation of Model Parameters and Outlier Effects in Time Series Joint Estimation of Model Parameters and Outlier Effects in Time Series’, 88(421), pp. 284–297.

Ahmar, A. et al. (2018) ‘Modeling Data Containing Outliers using ARIMA Additive Outlier (ARIMA-AO) Modeling Data Containing Outliers using ARIMA Additive Outlier (ARIMA-AO)’, Journal of Physics Conference Series, 954. doi: 10.1088/1742-6596/954/1/012010.

Ruey S. Tsay (1988) “Outliers, level shifts, and variance changes in time series”, Journal of Forecasting, 7, pp. 1-20.