Our blog

Know what we are thinking and doing

November 15, 2019 / by /

Outlier detection in Air Pollution data

On a daily basis, economical, technological and political changes cause air pollution, which is currently one of the most pressing environmental issues around the globe. It is particularly acute in large cities with the rising level of industrialization and growing urbanization. The polluted air we breathe results in serious health disorders and decreases the quality of life. Before steps are taken to decrease the harmful effect of daily produced pollutants, certain measurements are required.

Government agencies use air quality index to communicate with the public. An example of such communication is recommendations to reduce or avoid outdoor physical exertion when the air quality index (AQI) is too high. Different countries have their own air quality indices and they use their own method of calculation and health risk categories. Typically, outdoor1 air pollution can be defined as the emission of harmful substances into the atmosphere. This broad definition encapsulates a number of pollutants, including:

  • sulphur dioxide ($SO_2$),
  • nitrogen oxides ($NO_x$),
  • ozone ($O_3$),
  • particulate matter,
  • carbon monoxide ($CO$)
  • carbon dioxide ($CO_2$)
  • and volatile organic compounds ($VOC_s$).

We took the historical data of annual emissions of carbon dioxide per person, based on territorial emissions derived by Our World in Data. Per capita emissions (production-based) were calculated dividing the total national $CO_2$ emissions per year by population estimates. The summary of the dataset is presented below. The number of observations in the initial data is 42,844 and the range of variable time span is 1751 – 2017. For the further analysis the observation before 1970 that have value of zero are omitted. The dataset used in this analysis can be found here.

# load the required libraries used in the article
if (!require("pacman")) install.packages("pacman")
pacman::p_load(ggplot2, dplyr, kableExtra, tsoutliers, forecast, TSA)
co <- read.csv("co-emissions-per-capita.csv")
cond <- which(co$Year < 1970 & co$Co2 == 0)
co <- co[-cond,]

We can look at the summary of the dataframe.

options(knitr.kable.NA = '')
knitr::kable(summary(co), digits=2) %>%
kable_styling(bootstrap_options = "striped", full_width = F)
Entity Code Year Co2
Canada : 218 CAN : 218 Min. :1751 Min. : 0.0000
Germany : 218 DEU : 218 1st Qu.:1948 1st Qu.: 0.2029
Poland : 218 GBR : 218 Median :1975 Median : 1.0925
United Kingdom: 218 POL : 218 Mean :1965 Mean : 3.5506
United States : 218 USA : 218 3rd Qu.:1996 3rd Qu.: 4.4751
France : 210 FRA : 210 Max. :2017 Max. :252.6451
(Other) :15560 (Other):15560

The variables Entity and Code show the country and/ or area and its code, where the level of carbon dioxide was measured. The variables Year and Co2 show the time span of variable Co2 and the carbon dioxide measured in tones ($CO_2$ data has been converted from tones of carbon to tones of carbon dioxide ($CO_2$) using a conversion factor of 3.664).

We will now explore the pattern of $CO_2$ by looking at the data of 3 particular countries: Venezuela, United Arab Emirates, and the Bahamas.

co %>% filter(Entity %in% c("Venezuela", "United Arab Emirates", "Bahamas")) %>%
ggplot() + geom_line(aes(x = Year, y = Co2, color = Entity)) +
  theme(legend.position="None") + facet_grid(.~Entity, scales = "fixed") +
  ggtitle("The emissions of CO2 per capita")

These countries were not selected randomly. If you look at the time series of these countries separately, the different types of outliers: additive, innovational, and level shift outliers can be found in the time series of $CO_2$ in Venezuela, the United Arab Emirates and the Bahamas correspondingly. Outliers are usually the unexpected spikes or dips of observations at given points in time.

Outliers and detection

Time series observations reveal frequent shifts in the data. The shifts in a time series that cannot be explained are referred to as outliers. One important reason to detect outliers is that they can represent an error. At the same time, these observations are inconsistent with the rest of the series and can dramatically influence the analysis, thus affecting the forecasting capacity of the time series model. So, to increase the prediction accuracy they must be removed before the analysis. Here are the most common types of outliers:

  • Additive Outliers
  • Level Shifts
  • Transient Change
  • Innovation Outliers
  • Seasonal Level Shifts

Three of the aforementioned outliers are to be considered below. There are different tools and methods to detect and deal with outliers (time series decomposition, generalized extreme student deviation, ARIMA, etc.). In order to detect the presence of outliers in the time series we will use the R function tso() , an interface to the automatic detection procedure provided in the package tsoutliers. The R package tsoutliers implements the Chen and Liu procedure for the detection of outliers in time series. Different types of outliers (by default: AO additive outliers; LS level shifts, IO innovative outliers, etc.) can be selected in the function. The proposed procedure may be applied both to general seasonal and nonseasonal ARMA processes.

Additive Outliers

An additive outlier (AO) corresponds to an exogenous change of a single observation of the time series. Subsequent observations are unaffected by an additive outlier. Due to external causes, this type of outlier is usually associated with isolated incidents like measurement errors or impulse effects.

The graph indicates that the time series of the following countries can have an additive outlier/outliers:

  • Brunei
  • Mauritania
  • Togo
  • The United Kingdom
  • Venezuela
  • Yemen
AO <- co[co$Entity %in% c("Brunei", "Mauritania", "Togo", "United Kingdom", "Venezuela", "Yemen"), ]
ggplot(data = AO) + geom_line(aes(x = Year, y = Co2, color = Entity)) +
  theme(legend.position="None") + facet_wrap(~Entity, scales = "free") +
  ggtitle("The emissions of CO2 per capita")

We will now consider the time series of Venezuela and United Kingdom. The plot above shows that the value of the year 1946 (17 tonnes of carbon dioxide) in Venezuela is not typical for the data. To check this assumption the tso function can be used.

tsoutliers::tso(ts(AO[AO$Entity == "Venezuela", "Co2"], 
  start = AO[AO$Entity == "Venezuela", "Year"][1]), types = "AO")
## Series: ts(AO[AO$Entity == "Venezuela", "Co2"], start = AO[AO$Entity ==      "Venezuela", "Year"][1]) 
## Regression with ARIMA(0,1,0) errors 
## 
## Coefficients:
##          AO34
##       13.1953
## s.e.   0.4437
## 
## sigma^2 estimated as 0.3976:  log likelihood=-99.11
## AIC=202.21   AICc=202.33   BIC=207.5
## 
## Outliers:
##   type ind time coefhat tstat
## 1   AO  34 1937    13.2 29.74

We will now study the value of CO2 in Venezuela tested as an outlier (with the index of 34).

AO[AO$Entity == "Venezuela",][34,]
##          Entity Code Year      Co2
## 41780 Venezuela  VEN 1946 17.03064

The procedure first detects outliers upon a chosen ARIMA model (which is an iterative process), then it chooses the ARIMA model including the outliers detected in the previous step and removes the outliers that are not significant in the new fit. The procedure is repeated until no outliers are detected or until a maximum number of iterations (by default, 4) is reached (Chen and Liu, 2013). There is yet another TSA function that can be used to check the existence of additive outliers.

detectAO(arima(ts(AO[AO$Entity == "Venezuela", "Co2"]), 
  order = c(0, 0, 0)), alpha = 0.05, robust = TRUE) 
##              [,1]
## ind     34.000000
## lambda2  4.238379

The time series of Brunei also detects an outlier between the years of 1947 and 1949. These, however, are not the only types of outliers data can have. For example, the value of 1970 can be considered a level shift outlier. Different types of outliers can be simultaneously checked using the vector of possible types in the tso() function. Although, Chung and Liu (2013) state that, from a computational standpoint, the strategy of detecting outliers one by one may be the only feasible approach to dealing with multiple outliers, it seems more appropriate to estimate the outlier effects jointly rather than sequentially. Besides, a procedure based solely on iteratively adjusted residuals may often result in biased estimates for adjacent outliers.

tsoutliers::tso(ts(AO[AO$Entity == "United Kingdom", "Co2"], 
  start = AO[AO$Entity == "United Kingdom", "Year"][1]), types = c("AO","LS", "IO")) 
## Series: ts(AO[AO$Entity == "United Kingdom", "Co2"], start = AO[AO$Entity ==      "United Kingdom", "Year"][1]) 
## Regression with ARIMA(1,2,2) errors 
## 
## Coefficients:
##          ar1      ma1     ma2     AO94    IO122    AO127
##       0.5189  -1.7097  0.7403  -1.2028  -0.1929  -4.6164
## s.e.  0.1783   0.1424  0.1436   0.3247   0.1766   0.3261
## 
## sigma^2 estimated as 0.181:  log likelihood=-120.34
## AIC=254.68   AICc=255.22   BIC=278.31
## 
## Outliers:
##   type ind time coefhat   tstat
## 1   AO  94 1893 -1.2028  -3.704
## 2   IO 122 1921 -0.1929  -1.092
## 3   AO 127 1926 -4.6164 -14.155
AO[AO$Entity == "United Kingdom", ][c(94,122,127),]
##               Entity Code Year      Co2
## 40421 United Kingdom  GBR 1893 8.958951
## 40449 United Kingdom  GBR 1921 7.245594
## 40454 United Kingdom  GBR 1926 5.437124

As assumed, the levels of $CO_2$ in the year of 1893 and 1926 are treated as outliers. In addition, the algorithm found innovational outlier (see details about this type of outlier below) in the year 1921.

In forecasting models, removing outliers may be highly dangerous. If the forecasting models are sensitive to outliers, we need to replace them, because outliers will skew the coefficients. To deal with additive outliers, the tsoutliers() function from the package forecast, designed to identify and to suggest potential replacement values, can be used. In our data of the United Kingdom there are three outliers and, correspondingly, three replacements:

tsoutliers(ts(AO[AO$Entity == "United Kingdom", "Co2"],
  start = AO[AO$Entity == "United Kingdom", "Year"][1]))
## $index
## [1]  94 122 127
## 
## $replacements
## [1] 10.12905 10.18127 10.10963
replUnKing <- ts(AO[AO$Entity == "United Kingdom", "Co2"], 
  start = AO[AO$Entity == "United Kingdom", "Year"][1])
replUnKing[c(94, 122, 127)] <- c(10.12905, 10.18127, 10.10963)

An alternative useful function identifying and replacing outliers is tsclean() from the same package. In this case, linear interpolation is used to estimate missing values and to replace outliers.

replUnKing <- tsclean(ts(AO[AO$Entity == "United Kingdom", "Co2"], 
  start = AO[AO$Entity == "United Kingdom", "Year"][1]))
out <- data.frame(
  UK = ts(AO[AO$Entity == "United Kingdom", "Co2"], 
    start = AO[AO$Entity == "United Kingdom", "Year"][1]),
  replUnKing, Year = AO[AO$Entity == "United Kingdom", "Year"])

ggplot(data = out) + 
  geom_line(aes(x = Year, y = UK, color="#FC4E07"), size =1.1, alpha = 0.5) +
  geom_line(aes(x = Year, y = replUnKing, color="#E7B800")) +
  scale_color_discrete(name = " ", labels = c("Replaced", "UK CO2"))+ 
  labs(y = "CO2", title = "Replacing the outliers")

Innovational Outliers

Innovational Outlier or Intervention Outlier (IO) corresponds to an endogenous change of a single innovation of the time series and is usually associated with isolated incidents such as an impulse. The influence of the outliers may increase over time. It represents a shock.

The following countries are selected to visually study the case of having innovational outlier/outliers:

  • Cape Verde
  • Iran
  • Germany
  • Armenia
  • The United Arab Emirates
  • Mauritania
IO <- co[co$Entity %in% c("Cape Verde", "Iran", "Germany", "Armenia", 
  "United Arab Emirates", "Mauritania"), ]

ggplot(data = IO) + 
  geom_line(aes(x = Year, y = Co2, color = Entity)) +
  theme(legend.position="None") +
  facet_wrap(~Entity,scales = "free") +
  ggtitle("The emissions of CO2 per capita")

The time series for the United Arab Emirates shows innovational outliers in 1969 and 1998, which are to be tested:

tsoutliers::tso(ts(IO[IO$Entity == "United Arab Emirates","Co2"], 
  start = IO[IO$Entity == "United Arab Emirates", "Year"][1]),types = c("LS","IO")) 
## Series: ts(IO[IO$Entity == "United Arab Emirates", "Co2"], start = IO[IO$Entity ==      "United Arab Emirates", "Year"][1]) 
## Regression with ARIMA(2,0,0) errors 
## 
## Coefficients:
##          ar1      ar2     IO11    LS26    IO40
##       1.3633  -0.4085  66.2302  9.0712  9.2042
## s.e.  0.1161   0.1162   3.6876  6.0046  3.6875
## 
## sigma^2 estimated as 44.8:  log likelihood=-194.83
## AIC=401.67   AICc=403.29   BIC=414.13
## 
## Outliers:
##   type ind time coefhat  tstat
## 1   IO  11 1969  66.230 17.960
## 2   LS  26 1984   9.071  1.511
## 3   IO  40 1998   9.204  2.496
IO[IO$Entity == "United Arab Emirates",][c(11,40),]
##                     Entity Code Year       Co2
## 40279 United Arab Emirates  ARE 1969 100.61529
## 40308 United Arab Emirates  ARE 1998  28.42788

The algorithm detected a level shift outlier in the year 1984 (see details below).

Armenia’s level of $CO_2$ has an innovational outlier in 1993. It can also be seen through the function detectIO() and can be visually studied in the graph above.

tsoutliers::tso(ts(IO[IO$Entity == "Armenia", "Co2"],
  start = IO[IO$Entity == "Armenia", "Year"][1]),types = c("IO", "AO")) 
## Series: ts(IO[IO$Entity == "Armenia", "Co2"], start = IO[IO$Entity ==      "Armenia", "Year"][1]) 
## Regression with ARIMA(0,1,0) errors 
## 
## Coefficients:
##          IO35
##       -0.9156
## s.e.   0.1209
## 
## sigma^2 estimated as 0.01488:  log likelihood=40.23
## AIC=-76.45   AICc=-76.24   BIC=-72.33
## 
## Outliers:
##   type ind time coefhat  tstat
## 1   IO  35 1993 -0.9156 -7.571

The number 35 is tagged as an innovative outlier.

IO[IO$Entity == "Armenia",][35,]
##       Entity Code Year       Co2
## 1720 Armenia  ARM 1993 0.7458196

Detect innovative outlier with detectIO().

detectIO(arima(ts(IO[IO$Entity == "Armenia", "Co2"]), 
  order = c(0, 1, 0)), alpha = 0.05, robust = TRUE) 
##             [,1]      [,2]
## ind     35.00000 51.000000
## lambda1 -7.49181 -3.348504

Level Shift Outliers

Level Shift Outlier (LS) represents an abrupt change in the mean level and may or may not be seasonal (Seasonal Level Shift). LS is a change in the mean level of the time series starting at some point and continuing until the end of the observed period. In contrast to additive outliers, a level shift outlier affects many observations and has a permanent effect. It can be modelled in terms of step function with the magnitude equal to the omega parameter.

The countries below are selected to visually study the case of having level shift outlier/outliers:

  • Iceland
  • North Korea
  • Yemen
  • Bahamas
  • Lithuania
LS <- co[co$Entity %in% c("Iceland", "North Korea", "Yemen", "Bahamas", "Lithuania"), ]

ggplot(data = LS) + geom_line(aes(x = Year, y = Co2, color = Entity)) +
  theme(legend.position="None") + facet_wrap(~Entity,scales = "free") +
  ggtitle("The emissions of CO2 per capita")

A similar analysis for the level shift outlier was conducted for the time series of Iceland and Bahamas. The results are presented below:

tsoutliers::tso(ts(LS[LS$Entity == "Iceland", "Co2"], 
  start = LS[LS$Entity == "Iceland", "Year"][1]), types = c("LS", "AO"))
## Series: ts(LS[LS$Entity == "Iceland", "Co2"], start = LS[LS$Entity ==      "Iceland", "Year"][1]) 
## Regression with ARIMA(0,1,0) errors 
## 
## Coefficients:
##         LS12
##       5.1622
## s.e.  0.5347
## 
## sigma^2 estimated as 0.2897:  log likelihood=-61.85
## AIC=127.7   AICc=127.86   BIC=132.41
## 
## Outliers:
##   type ind time coefhat tstat
## 1   LS  12 1947   5.162 9.654

As can be seen from the output above, case number 12 is tagged as an outlier.

LS[LS$Entity == "Iceland",][12,]
##        Entity Code Year      Co2
## 16501 Iceland  ISL 1950 5.188092

And for Bahamas:

tsoutliers::tso(ts(LS[LS$Entity == "Bahamas", "Co2"], 
  start = LS[LS$Entity == "Bahamas", "Year"][1]),types = "LS") 
## Series: ts(LS[LS$Entity == "Bahamas", "Co2"], start = LS[LS$Entity ==      "Bahamas", "Year"][1]) 
## Regression with ARIMA(3,0,0) errors 
## 
## Coefficients:
##          ar1    ar2     ar3     LS22      LS32
##       0.1292  0.525  0.2965  26.6765  -23.6281
## s.e.  0.1150  0.097  0.1150   1.9063    1.8062
## 
## sigma^2 estimated as 6.589:  log likelihood=-159.16
## AIC=330.33   AICc=331.71   BIC=343.65
## 
## Outliers:
##   type ind time coefhat  tstat
## 1   LS  22 1971   26.68  13.99
## 2   LS  32 1981  -23.63 -13.08
LS[LS$Entity == "Bahamas",][c(22,32),]
##       Entity Code Year      Co2
## 2570 Bahamas  BHS 1971 38.67267
## 2580 Bahamas  BHS 1981 12.99207

A dummy variable can be used to account for level shift outliers and innovative outliers in the data. Rather than omitting the outlier, a dummy variable removes its effect. In the case of one outlier, the dummy variable takes value 1 for the observations before the outliers occur and 0 for other observations. For two and more outliers coding with several dummy variables can be implemented.

Conclusions

In the article, we considered three types of outliers (AO, IO, and LS) in time series modeling. Here is the summary of these types of outliers:

  • An Additive Outlier (AO) represents an isolated spike.
  • A Level Shift (LS) represents an abrupt change in the mean level.
  • An Innovational Outlier (IO) represents a shock.

To illustrate, we described the time series of annual emissions of carbon dioxide per person to detect these outliers. The iterative procedure of finding an outlier from the package tsoutliers was used. The next step in outlier analysis should be their adjustment.

Reference list

Chen, C. and Liu, L. (2013) ‘Joint Estimation of Model Parameters and Outlier Effects in Time Series Joint Estimation of Model Parameters and Outlier Effects in Time Series’, 88(421), pp. 284–297.

Ahmar, A. et al. (2018) ‘Modeling Data Containing Outliers using ARIMA Additive Outlier (ARIMA-AO) Modeling Data Containing Outliers using ARIMA Additive Outlier (ARIMA-AO)’, Journal of Physics Conference Series, 954. doi: 10.1088/1742-6596/954/1/012010.

Ruey S. Tsay (1988) “Outliers, level shifts, and variance changes in time series”, Journal of Forecasting, 7, pp. 1-20.

  1. We focused on outdoor pollution, however air pollution occurs in indoor