R for Data Science: Ch 13, 14, 16
In the majority of real-life analysis situations, you won’t be presented with a complete data set. Instead, you will need to create a data set by combining multiple series and sets from different sources. Combining multiple data sets is called Relational Data.
This process can be both time consuming and challenging, but the R-universe has many tools to help you. You can read more fully about using R for relational data here.
In this lab we will practice using relational data by looking at petroleum exploration activity on the Norwegian Continental Shelf. We want to combine data on exploration with firm-level financial data. The question we want to try to answer is how the firm financial situation affects drilling decisions. What type of firms are doing the risky “wildcat” exploratory drilling on the Norwegian Continental Shelf? How does a firm’s profitability affect how much drilling they do?
We start by loading in the packages in tidyverse
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Then we load in data on all exploratory wells drilled on the Norwegian Continental Shelf for a 10 year period. The data is originally from the Norwegian Petroleum Directorate, but we download it directly from my website in a slightly cleaned format.
exploration = read_csv("https://jmaurit.github.io/analytics/labs/data/wellbore_exploration_last_10_years.csv")
## Rows: 491 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (12): wlbWellboreName, wlbEntryDate, wlbCompletionDate, wlbDrillingOpera...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We have a couple of columns that are dates. We want to tell r to put these in date format.
exploration["wlbCompletionDate"] =as.Date(exploration$wlbCompletionDate, format = "%d.%m.%Y")
exploration["wlbEntryDate"] =as.Date(exploration$wlbEntryDate, format = "%d.%m.%Y")
Definitions of the columns in our data can be found here.
If we look at our data, we see a column named wlbPurpose. Here we find two types of wells:
Wildcat - or exploratory drilling. To find if there is discovery.
Appraisal - after discovery, to find how large/composition of a field
You can read more on well definitions and well classification from the Norwegian Petroleum Directorate.
The column wlbContent provides information on what was found in the well, if anything.
We can create a frequency table over the types of wells:
exploration %>%group_by(wlbPurpose, wlbContent) %>% summarize(n())
## `summarise()` has grouped output by 'wlbPurpose'. You can override using the
## `.groups` argument.
## # A tibble: 20 × 3
## # Groups: wlbPurpose [2]
## wlbPurpose wlbContent `n()`
## <chr> <chr> <int>
## 1 APPRAISAL DRY 19
## 2 APPRAISAL GAS 10
## 3 APPRAISAL GAS/CONDENSATE 3
## 4 APPRAISAL NOT APPLICABLE 5
## 5 APPRAISAL NOT AVAILABLE 1
## 6 APPRAISAL OIL 75
## 7 APPRAISAL OIL SHOWS 2
## 8 APPRAISAL OIL/GAS 32
## 9 APPRAISAL SHOWS 3
## 10 APPRAISAL <NA> 5
## 11 WILDCAT DRY 158
## 12 WILDCAT GAS 41
## 13 WILDCAT GAS/CONDENSATE 19
## 14 WILDCAT NOT APPLICABLE 7
## 15 WILDCAT OIL 53
## 16 WILDCAT OIL SHOWS 2
## 17 WILDCAT OIL/GAS 33
## 18 WILDCAT OIL/GAS/CONDENSATE 6
## 19 WILDCAT SHOWS 10
## 20 WILDCAT <NA> 7
A few of the categories for content that are not obvious:
Condensate: natural gas condensate - natural gas in liquid form.
Shows: trace amount of hydrocarbons
From these numbers we get a sense of the riskiness of wildcat drilling. A considerable majority have been dry wells. Another handful, designated as Shows find only trace amounts of oil and gas.
We can do a quick visualization of the categories of wildcat drills:
exploration %>% filter(wlbPurpose == "WILDCAT") %>% ggplot +
geom_bar(mapping = aes(x=wlbContent))
Notice here that the figure is not a direct plot of the data, but instead a plot of a calculated statistic (a count). You can read more about this here
Lets look at exploration drills per year
First, we want to add a variable called year. We extract this from the date variable. We use the year() function from the package lubridate.
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
exploration["year"] = year(exploration$wlbEntryDate)
Now I’d like to calculate out how many wells were drilled each year.
exp_per_year = exploration %>% group_by(year) %>% summarize(n())
colnames(exp_per_year)[2] = "number_wells"
exp_per_year
## # A tibble: 11 × 2
## year number_wells
## <dbl> <int>
## 1 2009 5
## 2 2010 46
## 3 2011 52
## 4 2012 43
## 5 2013 59
## 6 2014 57
## 7 2015 56
## 8 2016 38
## 9 2017 36
## 10 2018 53
## 11 2019 46
Why do you think there were so few wells in 2009?
We can make a quick graph of this, with one modification: Let us only look at wildcat wells:
exploration %>% filter(wlbPurpose=="WILDCAT") %>% ggplot + geom_bar(mapping = aes(x=year))
What trends do you see here?
Why do you think the number of wildcat wells went down after about 2014?
Now lets hypothesize (with good reason) that the decision to drill an exploratory well is related to the financial condition of a drilling company. We want to look closer at this question, so we aim at combining our exploration drilling data with financial data.
We read in Norwegian register data on oil and gas companies. This data, like in lab 2, is originally from the database Proff Forvalt, which you have access to as NHH students.
AS_data = read_csv("https://jmaurit.github.io/analytics/labs/data/accounting_cleanData.csv")
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 14311 Columns: 55
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (24): status, founded_date, municipality, NACE_desc, principality, city,...
## dbl (30): comp_id, year, employees, zip, comp_num, NACE_id, share_capital, t...
## lgl (1): auditor_notes
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
AS_data["year"] = as.numeric(AS_data$year)
One small note here: Notice that I use read.csv and not read_csv. read.csv is the base command - meaning it comes with r, not requiring an extra package. read_csv is a command that is part of the tidyverse (in the readr package.) read_csv can usually be easier to use and a bit smarter. But sometimes it breaks down. That happened with this data (for reasons I don’t quite understand.) But read.csv worked just fine.
The challenge we have with trying to join our data, is that in the exploration data we are told the drilling operator (wlbDrillingOperator) name, as well as the production license name.
The AS data also includes company name, but it might be risky to try to combine the data sets based on a written name: Company names can be misspelled or have inconsistent cases, etc. Ideally, we would like to match on some sort of common id. For example, the AS data includes a company id that identifies the company in the Norwegian company register.
A starting point would be to match this company id with the Norwegian Petroleum directorate ID for companies active on the NCS. This data I can get from the NPD webpages. We download it in slightly cleaned format from my website:
company = read_csv("https://jmaurit.github.io/analytics/labs/data/company.csv")
## Rows: 777 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): LongName, Orgnr, ShortName, NationCode, SurveyPrefix, LicenceOperC...
## dbl (1): NPD_id
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
In this data set, the column Orgnr refers to the company id in the Norwegian register. NPD_id is the NPD company id.
Since the company id variable is in numeric format in the AS data, I will also format the id as a variable in the company data set (this is actually a little data science no-no: Formatting a non-number as a number. Why?)
company["Orgnr"] = as.numeric(company$Orgnr)
Then we want to do two things.
We are only interested in a handful of the columns in company, so I want to use the comamnd select to select only those columns that I want.
Then we use the function inner_join to combine my data set AS_data with my selected columns from the company data. I use the common columns Orgnr and comp_id to match the data.
I use the pipe (%>%) to tie together these commands.
AS_merged = company %>% select(Orgnr, NPD_id, LongName, ShortName) %>% inner_join(AS_data, by = c("Orgnr"="comp_id"))
How is the command inner_join different from the similar commands left_join, and outer_join?
We still have a problem matching company data to the exploration data, since the exploration data does not include the NPD_id of the company. But it does include information on the license. If we can match the license to the NPD_id, then we can merge the two data sets.
License data is available from the NPD
licence = read_csv("http://jmaurit.github.io/analytics/labs/data/licence.csv")
## Rows: 26342 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): prlName, prlLicenseeDateValidFrom, prlLicenseeDateValidTo, cmpLongN...
## dbl (4): prlLicenseeInterest, prlLicenseeSdfi, prlNpdidLicence, cmpNpdidCompany
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
So first we want to merge our exploration data with our license data. Then we can use that information to merge exploration data with company financials.
First, let’s give our license data columns some simpler names:
colnames(licence) = c("prlName", "ValidFrom", "ValidTo", "LongName",
"licenseeInterest", "licenseeSdfi",
"operValidFrom", "operDateValidTo",
"NpdidLicence", "cmpNpdidCompany",
"LicenseeDateUpdated", "DatesyncNPD")
Here we first format the date variables that tell us from when and to an exploration license was valid.
licence["ValidFrom"] = as.Date(licence$ValidFrom, format="%d.%m.%Y")
licence["ValidTo"] = as.Date(licence$ValidTo, format="%d.%m.%Y")
For licenses that are still valid, I want to replace null values (NA) in the ValidTo column to the date 1.1.2020. In a second it should become clear why.
licence[is.na(licence$ValidTo),"ValidTo"] = as.Date("2020-01-01")
When we join the exploration and license data set, we want to match them not just on the company id, but also on date. The problem is that the license data has two date columns: ValidFrom and ValidTo.
We solve this matching problem in two steps.
We match (or join) the exploration and license data with left_join, which keeps all the exploration data and matches it with all relevant license data.
We then take the newly joined data and filter the entries that match our data criteria. That is where the well-bore entry date lies within the valid license period:
expl_lic = exploration %>%
left_join(licence, by = c("wlbProductionLicence" = "prlName")) %>%
filter(wlbEntryDate >= ValidFrom, wlbEntryDate < ValidTo)
If we explore the new data set we have created, we notice that for each exploration well, we can have multiple companies attached. This is because the licenses can have shared ownership.
One potential solution, is to only include the observations where the licensee is also the drilling operator. This is an imperfect solution, but we try it for now:
expl_oper = expl_lic %>% filter(wlbDrillingOperator==LongName)
Now just a little cleaning - we change the column with the NPD company id to have the name NPD_id to match the AS data.
colnames(expl_oper)[22] = "NPD_id"
Finally, we are ready to match (join) our exploration and company financial data!
expl_AS = expl_oper %>% inner_join(AS_merged, by=c("NPD_id", "year"))
One problem with the data set we have put together is that it is at the unit of exploration. That is, for each year we can have several observations per company per year. If we are mainly interested in the company as the unit of observation, then this is not ideal.
Therefor, let us aggregate the exploration data to the company-year level. Then we can again proceed to merge it with company financials (which are already at the company-year level.)
Now lets calculate number of exploration wells per year per company
compWells = expl_oper %>% group_by(NPD_id, year) %>% summarise(n())
## `summarise()` has grouped output by 'NPD_id'. You can override using the
## `.groups` argument.
colnames(compWells)[3] = "numbWells"
explAgg_AS = compWells %>% inner_join(AS_merged, by=c("NPD_id", "year"))
Finally, we can start to do some simple analysis of exploration and company financials.
I might have a hypothesis, for example, that bigger firms will tend to drill more wells. Let us quickly check this.
explAgg_AS %>% ggplot(aes(x=total_assets,y=numbWells)) +
geom_point() +
geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
We see some correlation here, but with lots of variation. Some small companies drill a lot. Some big companies drill very little.
We might also wonder which of the data points above come from the same company. We can try to display information about that in the graph as well.
explAgg_AS %>% ggplot(aes(x=total_assets,y=numbWells, color=factor(NPD_id))) +
geom_point(show.legend=FALSE)
Here we see that the points on the far right side all come from the same company. Let us see what that company is:
explAgg_AS %>% filter(total_assets > 3e8) %>% select(name) %>%unique()
## Adding missing grouping variables: `NPD_id`
## # A tibble: 1 × 2
## # Groups: NPD_id [1]
## NPD_id name
## <dbl> <chr>
## 1 17237817 EQUINOR ENERGY AS
No big surprise: Equinor.
Instead of size, let us look at profitability.
explAgg_AS %>% ggplot(aes(x=profitability,y=numbWells, color=factor(NPD_id))) +
geom_point(show.legend=FALSE)
## Warning: Removed 3 rows containing missing values (geom_point).
Here we tend to see a positive correlation as well. Firms with poor profitability tend to drill less.
Let us try something else with our data. Let us try to see some time trends for each company:
explAgg_AS %>% ggplot(aes(x=year, y=numbWells, color=factor(NPD_id))) +
geom_line(show.legend=FALSE)
Here we seem to see several trends.
A couple of companies are responsible for large parts of drilling. These companies generally ramped up their drilling towards about 2012-2013, and since then have reduced their drilling.
Other, smaller companies, were responsible for more drilling up to about 2014, and then again after 2016.
Small companies seem to be taking over much of the drilling activity after about 2016.
When we merged the company data (AS_merged) with the exploration data expl_oper, we lost approximately 90 observations. Create a data frame of these 90 observations. Why were they dropped? What do they have in common? Are the loss of these observations problematic for our further analysis? How so?
The price of brent crude oil is something that might well be pertinent to this data. Find a series for the average yearly (brent) oil price. Here for example. Import the data, and merge it with the data set. Compare the oil price with the number of exploration drills, preferably in the form of a plot.
Norway has special CO2 taxes for the petroleum sector. The rules can be a bit complicated but here is a quick history of the CO2-tax on burning of natural gas based on this document from the tax authority and information on current (2020) taxes and fees from the government:
CO2_tax = tibble(
year=2009:2020,
CO2_tax_NG_krSM3=c(0.49, 0.51, 0.44, 0.45, 0.46, 0.66,.82, 0.84, 0.90, 1, 1, 1.08)
)
CO2_tax %>% ggplot(aes(x=year, y=CO2_tax_NG_krSM3)) +
geom_line()
Do you think that this may effect exploratory drilling in the period studied? Do you think a correlation could be estimated with the given data (and with the addition of CO2-tax data)? Why or why not? You can read more about the CO2 tax here.
reg1 = lm(numbWells ~ total_assets + profitability, data=explAgg_AS)
summary(reg1)
How do you interpret the results?
What are the potential problems with this regression?
Are there other variables that would be interesting/important to include? If so, include them. Interpret the results.