Lab 2: Petroleum firms, Big data and Split-apply-combine

Background reading

R for Data Science: Ch 5, Ch 9 through 12, Ch 18

Intro

R really comes into its own when working with big data sets, with thousands, hundreds of thousands or even millions of rows of data. When trying to make sense of so much data, scrolling through spreadsheets often becomes impractical.

Increasingly, large data sets are becoming the norm in energy markets as well. It is important to learn a few tools in handling big data sets and trying to make sense of trends and patterns within all the noise.

In this lab we will start with a large data set of all Norwegian firms over several years. We will first clean and organize this data, and then do some exploratory analysis, trying to extract some trends from the data.

Topically, we want to explore some trends in the oil and gas industry in Norway in a particularly turbulent period.

Oil and gas prices fell sharply after 2012 and stayed low following innovations in “fracking” for oil and gas in the United States. This forced a major restructuring in the high-cost Norwegian oil and gas sector.

A question you can keep in the back of your head as you look through this data, is what lessons this restructuring has for the future of the oil and gas industry, both in Norway and the world? What potential future “shocks” does the industry face? How could government policy hurt or help.

Split-Apply-Combine

We have already used many of the tools and packages within the tidyverse. In this lab, we will focus on a group of tools that are especially handy for handling big data. Most of these can be found in the package dplyr, and come under the general title split-apply-combine.

The idea is that to identify trends and patterns in data, you need to first split the data into sensible groups, apply some function (like a mean, median, osv), and then combine the results from the different groups to see the overall pattern.

We begin by loading in the packages we need. Our group of tidyverse packages. We will also use a package called zoo, which includes tools for working with time-series data (data organised by date).

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(zoo)

## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(lubridate)

## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

The data we will be examining is register data on Norwegian-registered companies and organizations and their yearly accounting information over a period of 10 years. The data is taken from the database Proff Forvalt which is available through the NHH library. For simplicity, we will be using data I have slightly cleaned.

In lab 3, we will go on and try to match accounting data on companies active on the Norwegian continental shelf with data from the Norwegian petroleum directorate.

We start by importing the accounting data directly from my website and giving it the name “adf” (accounting data for firms).

adf  =  read_csv("https://jmaurit.github.io/anvendt_macro/data/adf.csv")

Here we see that the data frame has abut 130,000 observations and 47 variables.

Cleaning and formatting

We start by cleaning a bit. We have one columns caled “…1”, which is just row numbers. We can delete this:

adf = adf[-1]

We also can define some new names for our columns:

new_names = c("working_capital", "working_capital_perc", "fixed_assets", "long_debt", "NACE_desc", "NACE_code", "profit", "other_fin_instr", "employees", "depreciation", "change_inventories", "operating_income", "operating_costs", "operating_result", "equity", "total_assets", "org_type", "principality", "debt", "inv", "cash", "municipality", "corp_accounts", "short_debt", "accounts_receivable", "director", "liquidity", "wage_costs", "profitability", "current_assets", "pretax_profit", "orgnr", "audit_remarks", "audit_komments", "audit_explanation_txt", "audit_explanation", "sales_revenue", "solidity", "status", "founded_date", "dividend", "currency_code", "supply_cost", "inventory", "year", "name")

colnames(adf) = new_names
names(adf)

##  [1] "working_capital"       "working_capital_perc"  "fixed_assets"         
##  [4] "long_debt"             "NACE_desc"             "NACE_code"            
##  [7] "profit"                "other_fin_instr"       "employees"            
## [10] "depreciation"          "change_inventories"    "operating_income"     
## [13] "operating_costs"       "operating_result"      "equity"               
## [16] "total_assets"          "org_type"              "principality"         
## [19] "debt"                  "inv"                   "cash"                 
## [22] "municipality"          "corp_accounts"         "short_debt"           
## [25] "accounts_receivable"   "director"              "liquidity"            
## [28] "wage_costs"            "profitability"         "current_assets"       
## [31] "pretax_profit"         "orgnr"                 "audit_remarks"        
## [34] "audit_komments"        "audit_explanation_txt" "audit_explanation"    
## [37] "sales_revenue"         "solidity"              "status"               
## [40] "founded_date"          "dividend"              "currency_code"        
## [43] "supply_cost"           "inventory"             "year"                 
## [46] "name"

NACE is the european standard for industry sector. The code is composed of two parts

A main sector, and sub-sector. For example, 6.2 consists of a main sector (6: oil and gas extraction) and (.2: extraction of natural gas)

You can read more about the classifications here

I want to split the code NACE code into its component parts. First I have to tell R that I want it to recognize the NACE code as characters/text rather than as a number.

Then I want to use the command separate to split the column in two.

You can read more about separate here: her.

I also will use something called the pipe operator: %>%, which we already saw in the first lab. Basically, this is an operator that you can use with most commands within the tidyverse in order to apply several consecutive commands from left to right.

adf["NACE_code"] = as.character(adf$NACE_code)

adf = adf %>% separate(NACE_code, into=c("NACE1", "NACE2"), sep="\\.", remove=FALSE)

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 278 rows [134,
## 135, 136, 137, 138, 139, 140, 1191, 1192, 1193, 1194, 1195, 1196, 1197, 5482,
## 5483, 5484, 5485, 5486, 5487, ...].

adf[c("NACE1", "NACE2")]

## # A tibble: 129,895 × 2
##    NACE1 NACE2
##    <chr> <chr>
##  1 6     1    
##  2 6     1    
##  3 6     1    
##  4 6     1    
##  5 6     1    
##  6 6     1    
##  7 6     1    
##  8 70    1    
##  9 70    1    
## 10 70    1    
## # … with 129,885 more rows

Lets go step-by-step what we did.

In the first line, I told r to recognize the data in column NACE_code as a character/text rather than as numbers with decimals.
In the second line, I use the pipe operator to first say that we will use the adf data set, we then want to apply separate() function to the column NACE_code within the adf data set.
The parameter “into=c(”NACE1”, “NACE2”)“, tells r the names of the two new columns that will be created.
The parameter “sep=\.” tells r that it should look for a “.” as the symbol that separates the code.
The parameter “remove=FALSE”, tells R to retain the original NACE column

Some inspection of the data reveals there are some problems. For example, there are some NACE1 rows that are equal to zero, but the NACE system starts at 1. So we want to delete these rows.

We also see that data is missing for the year 2016, so will just delete everything from 2016.

We use the function filter from dplyr to update our data. You can read more here

adf = adf %>% filter(year!=2016)  %>%  filter(NACE1!="0")

We use the pipe operator to again carry out our commands compactly and intuitively. We start with the data frame, adf we want to use.
We apply the first filter command and say we do not want to include rows where NACE1 is 0 (the operator “!=” means “is not equal to”).
Then we apply another filter command, where we tell r to only include rows where year is not equal to 2016.

Seeing the trends: split-apply-combine

Now we are ready to do some descriptive analysis.

What we often want to do with a large data set like this is to split it up into groups, and then calculate some summarizing statistic about that group - like a sum, average, variance, etc.

For example, perhaps we want to calculate the sum of revenues, profits, costs and wage costs for all the main sectors (NACE1) and for each year.

To do this, we use commands that come under the theme of split-apply-combine, with most coming from the dplyr package, which is a part of tidyverse.

We will use the commands group_by and summarise that you can read more about here.

sum_adf = adf %>% group_by(NACE1, year) %>% 
  summarise(
    total_income = sum(operating_income, na.rm=TRUE),
    total_result = sum(operating_result, na.rm=TRUE),
    total_costs = sum(operating_costs, na.rm=TRUE),
    total_wages = sum(wage_costs, na.rm=TRUE)
  )

## `summarise()` has grouped output by 'NACE1'. You can override using the
## `.groups` argument.

Let’s go through this step-by-step.

Again we use the pipe operator, so we can write our commands compactly and intuitively from left to right.
We start with the data frame, adf, and then we tell r that we want to group the data by the NACE main sector (NACE1) and by year.
With this grouped data, we then use the summarise command to apply several functions to each group.

We have now created a new data frame of this aggregated data:

sum_adf

## # A tibble: 574 × 6
## # Groups:   NACE1 [82]
##    NACE1  year total_income total_result total_costs total_wages
##    <chr> <dbl>        <dbl>        <dbl>       <dbl>       <dbl>
##  1 1      2009      1862945        20103     1842846     1209569
##  2 1      2010      2048207         5557     2042654     1332220
##  3 1      2011      2296387        -2771     2299162     1540094
##  4 1      2012      2473189        74522     2398665     1618666
##  5 1      2013      2614102        31576     2582531     1758487
##  6 1      2014      2706554        51567     2654988     1776993
##  7 1      2015      2874254        40825     2833426     1883250
##  8 10     2009    105911476      4696117   101215367    15539230
##  9 10     2010    109992232      6070385   103921844    15452894
## 10 10     2011    116709503      5814757   110894746    15972804
## # … with 564 more rows

Tidying data

We now have data on total operating revenue, profit, cost and wage cost within each main industry for each year.

Lets use filter to just look at data from 2015:

sum_adf15 = filter(sum_adf, year==2015)

This data is now in wide format where the different variables are separate columns. We want to put it in long format, where the variables are represented as a column, and all values are in another column.

We will use the command pivot_longer() to make our data into long format. Read more here.

(R for data science discusses why and how to put data into long or tidy format. But they use the function gather, which works fine, but which pivot_longer() is meant to replace as a more intuitive and powerful alternative).

sum_adf15_long = sum_adf15 %>% pivot_longer(cols = total_income:total_wages, names_to="variabel", values_to="value")

Here we have told r to take the data frame sum_adf15, and from the columns between total_income and total_wages, and make these into long format.

It is easier to understand if you see the end result:

sum_adf15_long

## # A tibble: 328 × 4
## # Groups:   NACE1 [82]
##    NACE1  year variabel         value
##    <chr> <dbl> <chr>            <dbl>
##  1 1      2015 total_income   2874254
##  2 1      2015 total_result     40825
##  3 1      2015 total_costs    2833426
##  4 1      2015 total_wages    1883250
##  5 10     2015 total_income 151318372
##  6 10     2015 total_result   7663734
##  7 10     2015 total_costs  143631547
##  8 10     2015 total_wages   20149212
##  9 11     2015 total_income   7321582
## 10 11     2015 total_result    408186
## # … with 318 more rows

The column names are now aggregated in a single column called variabel, while all the values are in a column called value.

Let’s look at income by sector first:

income15 = filter(sum_adf15_long, variabel == "total_income" )

We can use ggplot to plot the total income of all the main sectors. We can see what sectors generate the most revenue in Norway. Perhaps you guess that oil and gas will be number one…

ggplot(income15, aes(x= reorder(NACE1, -value), y=value)) + 
  geom_col()

We see that a few sectors make up a large portion of income, though it is hard to see the labels.

Now we can also create a set of sub-plots (called facets) where we also look at profits, costs, and wage costs. You can read more about facets here

sum_adf15_long %>%
  ggplot(aes(x=reorder(NACE1,-value), y=value)) +
  geom_col() +
  facet_wrap(~variabel, ncol=2, scales="free") +
  theme(axis.text.x=element_blank(), 
        axis.ticks.x=element_blank()) +
  xlab("")

In this plot, I have used NACE1 as the x-variable (reordered by the -value column) and the value column as the y-variable.

with the command facet_wrap, I tell r to create individual sub-plots based on the categories in the column variabel.

The parameter ncol=2 tells r that I want to create two columns of subplots.

These are all ordered by income, and what we see if that the industry with the highest income is not the one with the highest profit.

Let’s look a little closer at the 10 largest industrial sectors by income. Here I use the command arrange.

arrange(income15, desc(value))$NACE1[1:10]

##  [1] "70" "6"  "46" "65" "47" "64" "10" "84" "86" "45"

From our list of codes we can see that the largest sector is in fact not oil and gas (which has code “6”, and is in second place.)

Instead the largest sector is #70: Activities of head offices; management consultancy activities. The biggest sector in our economy is office work!

Now Let’s see who makes the most profit:

result15 = filter(sum_adf15_long, variabel == "total_result" )

top4_res = arrange(result15, desc(value))$NACE1[1:4]
print(top4_res)

## [1] "64" "70" "35" "6"

Oil and gas is not the most profitable either (at least not in 2015.) Here it was #64 Finance. Then head-office work/consulting, and then comes the utility sector (electricity, gas and heat supply). And then at #4 came oil and gas.

Now lets look at profits in these four industries over time

topp4_data = filter(sum_adf, NACE1 %in% top4_res)

ggplot(topp4_data, aes(x=year, y=total_result)) +
  geom_line() +
  facet_wrap(~NACE1, scales="free")

Exercises

In the 6 years between 2009 and 2015, it looks like finance, consulting and head-office operations did quite well. Can you find an explanation for this?
Oil and gas extraction, on the other hand, had a peak of profitability in 2011, and declined to a low in 2015. What is behind this?

Sometimes we don’t want to aggregate data, but instead we want to transform the variable in some way or create a new variable. Then we use the function mutate.

Let’s say we want to create a ranking of the most profitable companies within each main sector. Then we can use mutate together with the function min_rank.

adf_ranked = adf %>% group_by(NACE1, year) %>% 
  mutate(
    ranking = min_rank(desc(operating_result))
  )

Here we have again started with our original data set, and again grouped by main sector and year. Within each group, we then create a new variable called ranking, which gives the rank in descending order (desc).

Then we can filter out the company with the most profits in each category in 2015:

most_profit = filter(adf_ranked, ranking==1 & year == 2015)
print(most_profit[c("name","NACE_desc")])

## # A tibble: 81 × 2
##    name                                                     NACE_desc           
##    <chr>                                                    <chr>               
##  1 STATOIL ASA                                              Utvinning av råolje 
##  2 TELENOR ASA                                              Hovedkontortjenester
##  3 HELSE SØR-ØST RHF                                        Offentlig administr…
##  4 KOMMUNAL LANDSPENSJONSKASSE GJENSIDIG FORSIKRINGSSELSKAP Livsforsikring      
##  5 DNB BANK ASA                                             Bankvirksomhet elle…
##  6 HYDRO ALUMINIUM AS                                       Produksjon av primæ…
##  7 NORSK TIPPING AS                                         Lotteri og totalisa…
##  8 POSTEN NORGE AS                                          Landsdekkende postt…
##  9 TELENOR NORGE AS                                         Trådløs telekommuni…
## 10 NORWEGIAN AIR SHUTTLE ASA                                Lufttransport med p…
## # … with 71 more rows

Of course, this is total profits. A better measure,depending on what you are concerned with, might be profits as a share of revenues (profit margin).

Assignment

Filter out all the companies in the oil and gas sector.
Consider the field [“org_type”]. Restrict the data to companies of type “AS” and “ASA” (AS: Stock companies, ASA: listed stock companies).
Calculate total revenue, profit and liquidity each year
Use ggplot to plot these results.
Now create a measure of profit margin (profit/income). Make a bar graph showing which companies had the highest profit margin in 2015. Is there correlation with firm size?
Calculate mean profit margin of oil and gas firms over time. Make a chart.
Open ended question:

Choose 2-3 financial metrics of oil and gas firms (or create your own).
Show in a chart how these metrics vary over firms (by charting against another variable, for example).
Show how these metrics vary over time, aggregated in some way (sum, mean, variance(?)).
Interpret your results. You may have to search for extra information. What was the state of the oil and gas industry during this period?

Solutions

Filter out all companies from oil and gas companies

adf_OG = adf %>% filter(NACE1=="6")

Consider the field [“org_type”]. Restrict the data to companies of type “AS” and “ASA” (AS: Stock companies, ASA: listed stock companies).

adf_OG = adf_OG %>% filter(org_type=="AS" | org_type=="ASA")

Calculate total revenue, profit and liquidity each year

adf_sum = adf_OG %>% group_by(year) %>% summarize(
  total_revenue = sum(operating_income, na.rm=TRUE), 
  total_profit = sum(profit, na.rm=TRUE),
  liquidity = mean(liquidity, na.rm=TRUE)
)

adf_sum_long = adf_sum %>% pivot_longer(cols=c(2:4), names_to="indicator", values_to="values")

Use ggplot to plot these results.

adf_sum_long %>% ggplot(aes(x=year, y=values)) +
  geom_col() +
  facet_wrap(~indicator, ncol=2, scales="free_y")

Now create a measure of profit margin (profit/income). Make a bar graph showing which companies had the highest profit margin in 2015. Is there correlation with firm size?

#first filter out observations with 0 in operating income so we are not dividing by 0
adf_OG = adf_OG %>% filter(operating_income!=0)

adf_OG = adf_OG %>% mutate(
  pmargin = profit/operating_income
)

#delete NA data
adf_OG = adf_OG %>% filter(!is.na(name))

adf_OG %>% filter(year==2015) %>% ggplot(aes(x=reorder(name, -desc(pmargin)), y=pmargin)) +
  geom_col() +
  theme(axis.text.x = element_text(angle = 90))

adf_OG %>%filter(year==2015)  %>% ggplot(aes(x=total_assets, y=pmargin)) +
  geom_point() +
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

It is hard to tell with so few data points, plus statoil is clearly a large outlier, but generally speaking, the mid-sized firms seemed to have done marginally better during this year.

Calculate mean profit margin of oil and gas firms over time. Make a chart.

pmargin_mean = adf_OG %>% group_by(year) %>% summarize(
  mean_pmargin = mean(pmargin, na.rm=TRUE) #removes NA data if found
)

pmargin_mean %>% filter(year>2009) %>% ggplot(aes(x=year, y=mean_pmargin)) +
  geom_line() +
  geom_point()

# We only have a single data point for 2009, so start from 2010

Open ended question:

Choose 2-3 financial metrics of oil and gas firms (or create your own).
Show in a chart how these metrics vary over firms (by charting against another variable, for example).
Show how these metrics vary over time, aggregated in some way (sum, mean, variance(?)).
Interpret your results. You may have to search for extra information. What was the state of the oil and gas industry during this period?

There are many possible interesting things to look at here. Because of the complex tax rules in place for the Norwegian Continental shelf, one interesting area to look at are the various profit metrics: Profit, operating profit and pre-tax profit, all which exist in the data set.

I start by limiting my dataset to a subset of columns

profit_data = adf_OG %>% select(name, year, operating_result, profit, pretax_profit, profitability, depreciation, operating_costs, operating_income)

Lets make a simple scatter plot comparing profit with pretax profit

profit_data %>% ggplot(aes(x=profit, y=pretax_profit, color=year)) +
  geom_point() +
  geom_abline(aes(slope=1, intercept=0))

profit_data %>% ggplot(aes(x=profit, y=operating_result, color=year)) +
  geom_point() +
  geom_abline(aes(slope=1, intercept=0))

Both figures above show how dramatically tax and other factors affect profit.

Lets regularize these a bit by dividing by operating income

profit_data = profit_data %>% filter(operating_income!=0)

profit_data = profit_data %>% mutate(
  profit_margin = profit/operating_income, 
  pretax_profit_margin = pretax_profit/operating_income, 
  operating_margin = operating_result/operating_income
)

profit_data %>% ggplot(aes(x=profit_margin, y=pretax_profit_margin, color=year)) +
  geom_point() +
  geom_abline(aes(slope=1, intercept=0)) +
  geom_smooth() +
  xlim(-1,1) +
  ylim(-1,1)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## Warning: Removed 25 rows containing non-finite values (stat_smooth).

## Warning: Removed 25 rows containing missing values (geom_point).

profit_data %>% ggplot(aes(x=profit_margin, y=operating_margin, color=year)) +
  geom_point() +
  geom_abline(aes(slope=1, intercept=0)) +
  geom_smooth() +
  xlim(-1,1) +
  ylim(-1,1)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## Warning: Removed 24 rows containing non-finite values (stat_smooth).

## Warning: Removed 24 rows containing missing values (geom_point).

The S-shape of this relationship is interesting, but it could just be an artefact of a couple of outliers. But in general we can say that the tax system has a two-fold effect. It of course reduces the positive profit margin, but it also lessons the negative profit margin. That is, the government provides an effective subsidy to loss-making firms. This is in some ways intentional and in line with the aims of tax-neutrality.

Lets aggregate to year and look at the results

profit_agg = profit_data %>% group_by(year) %>% summarize(
  mean_pmargin = mean(profit_margin, na.rm=TRUE),
  mean_pretax_margin = mean(pretax_profit_margin, na.rm=TRUE),
  mean_operating_margin = mean(operating_margin, na.rm=TRUE)
)

profit_agg_long = profit_agg %>% pivot_longer(cols=c(2:4), values_to="Margins", names_to="Indicator")

profit_agg_long %>% filter(year>2009) %>%  ggplot(aes(x=year, y=Margins, color=Indicator)) +
  geom_line(position=position_jitter())

#Notice that I use the position argument to "jitter" the lines in order differentiate pretax_margin with operating_margin which are almost equal in aggregate.