R for Data Science: Ch 5, Ch 9 through 12, Ch 18
R really comes into its own when working with big data sets, with thousands, hundreds of thousands or even millions of rows of data. When trying to make sense of so much data, scrolling through spreadsheets often becomes impractical.
Increasingly, large data sets are becoming the norm in energy markets as well. It is important to learn a few tools in handling big data sets and trying to make sense of trends and patterns within all the noise.
In this lab we will start with a large data set of all Norwegian firms over several years. We will first clean and organize this data, and then do some exploratory analysis, trying to extract some trends from the data.
Topically, we want to explore some trends in the oil and gas industry in Norway in a particularly turbulent period.
Oil and gas prices fell sharply after 2012 and stayed low following innovations in “fracking” for oil and gas in the United States. This forced a major restructuring in the high-cost Norwegian oil and gas sector.
A question you can keep in the back of your head as you look through this data, is what lessons this restructuring has for the future of the oil and gas industry, both in Norway and the world? What potential future “shocks” does the industry face? How could government policy hurt or help.
We have already used many of the tools and packages within the tidyverse. In this lab, we will focus on a group of tools that are especially handy for handling big data. Most of these can be found in the package dplyr, and come under the general title split-apply-combine.
The idea is that to identify trends and patterns in data, you need to first split the data into sensible groups, apply some function (like a mean, median, osv), and then combine the results from the different groups to see the overall pattern.
We begin by loading in the packages we need. Our group of tidyverse packages. We will also use a package called zoo, which includes tools for working with time-series data (data organised by date).
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(zoo)
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(lubridate)
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
The data we will be examining is register data on Norwegian-registered companies and organizations and their yearly accounting information over a period of 10 years. The data is taken from the database Proff Forvalt which is available through the NHH library. For simplicity, we will be using data I have slightly cleaned.
In lab 3, we will go on and try to match accounting data on companies active on the Norwegian continental shelf with data from the Norwegian petroleum directorate.
We start by importing the accounting data directly from my website and giving it the name “adf” (accounting data for firms).
adf = read_csv("https://jmaurit.github.io/anvendt_macro/data/adf.csv")
Here we see that the data frame has abut 130,000 observations and 47 variables.
We start by cleaning a bit. We have one columns caled “…1”, which is just row numbers. We can delete this:
adf = adf[-1]
We also can define some new names for our columns:
new_names = c("working_capital", "working_capital_perc", "fixed_assets", "long_debt", "NACE_desc", "NACE_code", "profit", "other_fin_instr", "employees", "depreciation", "change_inventories", "operating_income", "operating_costs", "operating_result", "equity", "total_assets", "org_type", "principality", "debt", "inv", "cash", "municipality", "corp_accounts", "short_debt", "accounts_receivable", "director", "liquidity", "wage_costs", "profitability", "current_assets", "pretax_profit", "orgnr", "audit_remarks", "audit_komments", "audit_explanation_txt", "audit_explanation", "sales_revenue", "solidity", "status", "founded_date", "dividend", "currency_code", "supply_cost", "inventory", "year", "name")
colnames(adf) = new_names
names(adf)
## [1] "working_capital" "working_capital_perc" "fixed_assets"
## [4] "long_debt" "NACE_desc" "NACE_code"
## [7] "profit" "other_fin_instr" "employees"
## [10] "depreciation" "change_inventories" "operating_income"
## [13] "operating_costs" "operating_result" "equity"
## [16] "total_assets" "org_type" "principality"
## [19] "debt" "inv" "cash"
## [22] "municipality" "corp_accounts" "short_debt"
## [25] "accounts_receivable" "director" "liquidity"
## [28] "wage_costs" "profitability" "current_assets"
## [31] "pretax_profit" "orgnr" "audit_remarks"
## [34] "audit_komments" "audit_explanation_txt" "audit_explanation"
## [37] "sales_revenue" "solidity" "status"
## [40] "founded_date" "dividend" "currency_code"
## [43] "supply_cost" "inventory" "year"
## [46] "name"
NACE is the european standard for industry sector. The code is composed of two parts
You can read more about the classifications here
I want to split the code NACE code into its component parts. First I have to tell R that I want it to recognize the NACE code as characters/text rather than as a number.
Then I want to use the command separate to split the column in two.
You can read more about separate here: her.
I also will use something called the pipe operator: %>%, which we already saw in the first lab. Basically, this is an operator that you can use with most commands within the tidyverse in order to apply several consecutive commands from left to right.
adf["NACE_code"] = as.character(adf$NACE_code)
adf = adf %>% separate(NACE_code, into=c("NACE1", "NACE2"), sep="\\.", remove=FALSE)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 278 rows [134,
## 135, 136, 137, 138, 139, 140, 1191, 1192, 1193, 1194, 1195, 1196, 1197, 5482,
## 5483, 5484, 5485, 5486, 5487, ...].
adf[c("NACE1", "NACE2")]
## # A tibble: 129,895 × 2
## NACE1 NACE2
## <chr> <chr>
## 1 6 1
## 2 6 1
## 3 6 1
## 4 6 1
## 5 6 1
## 6 6 1
## 7 6 1
## 8 70 1
## 9 70 1
## 10 70 1
## # … with 129,885 more rows
Lets go step-by-step what we did.
In the first line, I told r to recognize the data in column NACE_code as a character/text rather than as numbers with decimals.
In the second line, I use the pipe operator to first say that we will use the adf data set, we then want to apply separate() function to the column NACE_code within the adf data set.
The parameter “into=c(”NACE1”, “NACE2”)“, tells r the names of the two new columns that will be created.
The parameter “sep=\.” tells r that it should look for a “.” as the symbol that separates the code.
The parameter “remove=FALSE”, tells R to retain the original NACE column
Some inspection of the data reveals there are some problems. For example, there are some NACE1 rows that are equal to zero, but the NACE system starts at 1. So we want to delete these rows.
We also see that data is missing for the year 2016, so will just delete everything from 2016.
We use the function filter from dplyr to update our data. You can read more here
adf = adf %>% filter(year!=2016) %>% filter(NACE1!="0")
We use the pipe operator to again carry out our commands compactly and intuitively. We start with the data frame, adf we want to use.
We apply the first filter command and say we do not want to include rows where NACE1 is 0 (the operator “!=” means “is not equal to”).
Then we apply another filter command, where we tell r to only include rows where year is not equal to 2016.
Now we are ready to do some descriptive analysis.
What we often want to do with a large data set like this is to split it up into groups, and then calculate some summarizing statistic about that group - like a sum, average, variance, etc.
For example, perhaps we want to calculate the sum of revenues, profits, costs and wage costs for all the main sectors (NACE1) and for each year.
To do this, we use commands that come under the theme of split-apply-combine, with most coming from the dplyr package, which is a part of tidyverse.
We will use the commands group_by and summarise that you can read more about here.
sum_adf = adf %>% group_by(NACE1, year) %>%
summarise(
total_income = sum(operating_income, na.rm=TRUE),
total_result = sum(operating_result, na.rm=TRUE),
total_costs = sum(operating_costs, na.rm=TRUE),
total_wages = sum(wage_costs, na.rm=TRUE)
)
## `summarise()` has grouped output by 'NACE1'. You can override using the
## `.groups` argument.
Let’s go through this step-by-step.
Again we use the pipe operator, so we can write our commands compactly and intuitively from left to right.
We start with the data frame, adf, and then we tell r that we want to group the data by the NACE main sector (NACE1) and by year.
With this grouped data, we then use the summarise command to apply several functions to each group.
We have now created a new data frame of this aggregated data:
sum_adf
## # A tibble: 574 × 6
## # Groups: NACE1 [82]
## NACE1 year total_income total_result total_costs total_wages
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2009 1862945 20103 1842846 1209569
## 2 1 2010 2048207 5557 2042654 1332220
## 3 1 2011 2296387 -2771 2299162 1540094
## 4 1 2012 2473189 74522 2398665 1618666
## 5 1 2013 2614102 31576 2582531 1758487
## 6 1 2014 2706554 51567 2654988 1776993
## 7 1 2015 2874254 40825 2833426 1883250
## 8 10 2009 105911476 4696117 101215367 15539230
## 9 10 2010 109992232 6070385 103921844 15452894
## 10 10 2011 116709503 5814757 110894746 15972804
## # … with 564 more rows
We now have data on total operating revenue, profit, cost and wage cost within each main industry for each year.
Lets use filter to just look at data from 2015:
sum_adf15 = filter(sum_adf, year==2015)
This data is now in wide format where the different variables are separate columns. We want to put it in long format, where the variables are represented as a column, and all values are in another column.
We will use the command pivot_longer() to make our data into long format. Read more here.
(R for data science discusses why and how to put data into long or tidy format. But they use the function gather, which works fine, but which pivot_longer() is meant to replace as a more intuitive and powerful alternative).
sum_adf15_long = sum_adf15 %>% pivot_longer(cols = total_income:total_wages, names_to="variabel", values_to="value")
Here we have told r to take the data frame sum_adf15, and from the columns between total_income and total_wages, and make these into long format.
It is easier to understand if you see the end result:
sum_adf15_long
## # A tibble: 328 × 4
## # Groups: NACE1 [82]
## NACE1 year variabel value
## <chr> <dbl> <chr> <dbl>
## 1 1 2015 total_income 2874254
## 2 1 2015 total_result 40825
## 3 1 2015 total_costs 2833426
## 4 1 2015 total_wages 1883250
## 5 10 2015 total_income 151318372
## 6 10 2015 total_result 7663734
## 7 10 2015 total_costs 143631547
## 8 10 2015 total_wages 20149212
## 9 11 2015 total_income 7321582
## 10 11 2015 total_result 408186
## # … with 318 more rows
The column names are now aggregated in a single column called variabel, while all the values are in a column called value.
Let’s look at income by sector first:
income15 = filter(sum_adf15_long, variabel == "total_income" )
We can use ggplot to plot the total income of all the main sectors. We can see what sectors generate the most revenue in Norway. Perhaps you guess that oil and gas will be number one…
ggplot(income15, aes(x= reorder(NACE1, -value), y=value)) +
geom_col()
We see that a few sectors make up a large portion of income, though it is hard to see the labels.
Now we can also create a set of sub-plots (called facets) where we also look at profits, costs, and wage costs. You can read more about facets here
sum_adf15_long %>%
ggplot(aes(x=reorder(NACE1,-value), y=value)) +
geom_col() +
facet_wrap(~variabel, ncol=2, scales="free") +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank()) +
xlab("")
In this plot, I have used NACE1 as the x-variable (reordered by the -value column) and the value column as the y-variable.
with the command facet_wrap, I tell r to create individual sub-plots based on the categories in the column variabel.
The parameter ncol=2 tells r that I want to create two columns of subplots.
These are all ordered by income, and what we see if that the industry with the highest income is not the one with the highest profit.
Let’s look a little closer at the 10 largest industrial sectors by income. Here I use the command arrange.
arrange(income15, desc(value))$NACE1[1:10]
## [1] "70" "6" "46" "65" "47" "64" "10" "84" "86" "45"
From our list of codes we can see that the largest sector is in fact not oil and gas (which has code “6”, and is in second place.)
Instead the largest sector is #70: Activities of head offices; management consultancy activities. The biggest sector in our economy is office work!
Now Let’s see who makes the most profit:
result15 = filter(sum_adf15_long, variabel == "total_result" )
top4_res = arrange(result15, desc(value))$NACE1[1:4]
print(top4_res)
## [1] "64" "70" "35" "6"
Oil and gas is not the most profitable either (at least not in 2015.) Here it was #64 Finance. Then head-office work/consulting, and then comes the utility sector (electricity, gas and heat supply). And then at #4 came oil and gas.
Now lets look at profits in these four industries over time
topp4_data = filter(sum_adf, NACE1 %in% top4_res)
ggplot(topp4_data, aes(x=year, y=total_result)) +
geom_line() +
facet_wrap(~NACE1, scales="free")
In the 6 years between 2009 and 2015, it looks like finance, consulting and head-office operations did quite well. Can you find an explanation for this?
Oil and gas extraction, on the other hand, had a peak of profitability in 2011, and declined to a low in 2015. What is behind this?
Sometimes we don’t want to aggregate data, but instead we want to transform the variable in some way or create a new variable. Then we use the function mutate.
Let’s say we want to create a ranking of the most profitable companies within each main sector. Then we can use mutate together with the function min_rank.
adf_ranked = adf %>% group_by(NACE1, year) %>%
mutate(
ranking = min_rank(desc(operating_result))
)
Here we have again started with our original data set, and again grouped by main sector and year. Within each group, we then create a new variable called ranking, which gives the rank in descending order (desc).
Then we can filter out the company with the most profits in each category in 2015:
most_profit = filter(adf_ranked, ranking==1 & year == 2015)
print(most_profit[c("name","NACE_desc")])
## # A tibble: 81 × 2
## name NACE_desc
## <chr> <chr>
## 1 STATOIL ASA Utvinning av råolje
## 2 TELENOR ASA Hovedkontortjenester
## 3 HELSE SØR-ØST RHF Offentlig administr…
## 4 KOMMUNAL LANDSPENSJONSKASSE GJENSIDIG FORSIKRINGSSELSKAP Livsforsikring
## 5 DNB BANK ASA Bankvirksomhet elle…
## 6 HYDRO ALUMINIUM AS Produksjon av primæ…
## 7 NORSK TIPPING AS Lotteri og totalisa…
## 8 POSTEN NORGE AS Landsdekkende postt…
## 9 TELENOR NORGE AS Trådløs telekommuni…
## 10 NORWEGIAN AIR SHUTTLE ASA Lufttransport med p…
## # … with 71 more rows
Of course, this is total profits. A better measure,depending on what you are concerned with, might be profits as a share of revenues (profit margin).
Filter out all the companies in the oil and gas sector.
Consider the field [“org_type”]. Restrict the data to companies of type “AS” and “ASA” (AS: Stock companies, ASA: listed stock companies).
Calculate total revenue, profit and liquidity each year
Use ggplot to plot these results.
Now create a measure of profit margin (profit/income). Make a bar graph showing which companies had the highest profit margin in 2015. Is there correlation with firm size?
Calculate mean profit margin of oil and gas firms over time. Make a chart.
Open ended question:
Choose 2-3 financial metrics of oil and gas firms (or create your own).
Show in a chart how these metrics vary over firms (by charting against another variable, for example).
Show how these metrics vary over time, aggregated in some way (sum, mean, variance(?)).
Interpret your results. You may have to search for extra information. What was the state of the oil and gas industry during this period?
adf_OG = adf %>% filter(NACE1=="6")
adf_OG = adf_OG %>% filter(org_type=="AS" | org_type=="ASA")
adf_sum = adf_OG %>% group_by(year) %>% summarize(
total_revenue = sum(operating_income, na.rm=TRUE),
total_profit = sum(profit, na.rm=TRUE),
liquidity = mean(liquidity, na.rm=TRUE)
)
adf_sum_long = adf_sum %>% pivot_longer(cols=c(2:4), names_to="indicator", values_to="values")
adf_sum_long %>% ggplot(aes(x=year, y=values)) +
geom_col() +
facet_wrap(~indicator, ncol=2, scales="free_y")
#first filter out observations with 0 in operating income so we are not dividing by 0
adf_OG = adf_OG %>% filter(operating_income!=0)
adf_OG = adf_OG %>% mutate(
pmargin = profit/operating_income
)
#delete NA data
adf_OG = adf_OG %>% filter(!is.na(name))
adf_OG %>% filter(year==2015) %>% ggplot(aes(x=reorder(name, -desc(pmargin)), y=pmargin)) +
geom_col() +
theme(axis.text.x = element_text(angle = 90))
adf_OG %>%filter(year==2015) %>% ggplot(aes(x=total_assets, y=pmargin)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
It is hard to tell with so few data points, plus statoil is clearly a large outlier, but generally speaking, the mid-sized firms seemed to have done marginally better during this year.
pmargin_mean = adf_OG %>% group_by(year) %>% summarize(
mean_pmargin = mean(pmargin, na.rm=TRUE) #removes NA data if found
)
pmargin_mean %>% filter(year>2009) %>% ggplot(aes(x=year, y=mean_pmargin)) +
geom_line() +
geom_point()
# We only have a single data point for 2009, so start from 2010
Choose 2-3 financial metrics of oil and gas firms (or create your own).
Show in a chart how these metrics vary over firms (by charting against another variable, for example).
Show how these metrics vary over time, aggregated in some way (sum, mean, variance(?)).
Interpret your results. You may have to search for extra information. What was the state of the oil and gas industry during this period?
There are many possible interesting things to look at here. Because of the complex tax rules in place for the Norwegian Continental shelf, one interesting area to look at are the various profit metrics: Profit, operating profit and pre-tax profit, all which exist in the data set.
I start by limiting my dataset to a subset of columns
profit_data = adf_OG %>% select(name, year, operating_result, profit, pretax_profit, profitability, depreciation, operating_costs, operating_income)
Lets make a simple scatter plot comparing profit with pretax profit
profit_data %>% ggplot(aes(x=profit, y=pretax_profit, color=year)) +
geom_point() +
geom_abline(aes(slope=1, intercept=0))
profit_data %>% ggplot(aes(x=profit, y=operating_result, color=year)) +
geom_point() +
geom_abline(aes(slope=1, intercept=0))
Both figures above show how dramatically tax and other factors affect profit.
Lets regularize these a bit by dividing by operating income
profit_data = profit_data %>% filter(operating_income!=0)
profit_data = profit_data %>% mutate(
profit_margin = profit/operating_income,
pretax_profit_margin = pretax_profit/operating_income,
operating_margin = operating_result/operating_income
)
profit_data %>% ggplot(aes(x=profit_margin, y=pretax_profit_margin, color=year)) +
geom_point() +
geom_abline(aes(slope=1, intercept=0)) +
geom_smooth() +
xlim(-1,1) +
ylim(-1,1)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 25 rows containing non-finite values (stat_smooth).
## Warning: Removed 25 rows containing missing values (geom_point).
profit_data %>% ggplot(aes(x=profit_margin, y=operating_margin, color=year)) +
geom_point() +
geom_abline(aes(slope=1, intercept=0)) +
geom_smooth() +
xlim(-1,1) +
ylim(-1,1)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 24 rows containing non-finite values (stat_smooth).
## Warning: Removed 24 rows containing missing values (geom_point).
The S-shape of this relationship is interesting, but it could just be an artefact of a couple of outliers. But in general we can say that the tax system has a two-fold effect. It of course reduces the positive profit margin, but it also lessons the negative profit margin. That is, the government provides an effective subsidy to loss-making firms. This is in some ways intentional and in line with the aims of tax-neutrality.
Lets aggregate to year and look at the results
profit_agg = profit_data %>% group_by(year) %>% summarize(
mean_pmargin = mean(profit_margin, na.rm=TRUE),
mean_pretax_margin = mean(pretax_profit_margin, na.rm=TRUE),
mean_operating_margin = mean(operating_margin, na.rm=TRUE)
)
profit_agg_long = profit_agg %>% pivot_longer(cols=c(2:4), values_to="Margins", names_to="Indicator")
profit_agg_long %>% filter(year>2009) %>% ggplot(aes(x=year, y=Margins, color=Indicator)) +
geom_line(position=position_jitter())
#Notice that I use the position argument to "jitter" the lines in order differentiate pretax_margin with operating_margin which are almost equal in aggregate.