R for Data Science: Ch 1 through 4, 7
In this lab, we will get an introduction in how to import, organize and visualize data. There is a steep learning curve if you haven’t used R or similar type of program before, so don’t worry if you feel a bit frustrated in the beginning.
In this lab, and throughout the course I will be referring to the online book, R for Data Science. The web version of this book is free to use.
We will be using a group of packages, called the tidyverse. These generally help make working with data in R easier.
If we haven’t installed tidyverse on our computer before, we run:
install.packages("tidyverse")
And then we need to open up the packages before we can use them:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
tidyverse consists of the following packages:
Significant amounts of oil and gas was first found on the Norwegian Continental Shelf (NCS) on December 24th of 1969, when a “wild cat” (or exploratory) drilling by Phillips petroleum in effect discovered what became known as the Ekofisk field in the North Sea. Ekofisk is, to this day, the largest oil field discovery on the NCS. Subsequent large fields were found, notably the second largest field, Statfjord.
In the early days of the Norwegian industry, foreign and especially American firms tended to dominate. The American firms, such as Exxon, had the most know-how about drilling and extraction in difficult deep water environments that came from their experience in the Gulf of Mexico. But the stormy Norwegian portion of the North Sea provided new challenges to these firms.
From the beginning, the Norwegian government had plans to make sure that the benefits of the oil and gas finds were widely distributed, and led to a long term economic development in Norway.
A Norwegian state oil company was established, called Statoil (now Equinor), of which all foreign oil companies needed to partner with and transfer technology and know how. Initially, Statoil was wholly state-owned, but shares in the company were sold in 2001. Currently the state owns 67% and the Norwegian pension fund (folketrygdfondet) owns 3.45%. The rest of the company ownership is through publicly traded shares. Officially, the government has oversight over the company, but it is allowed a great deal of freedom to operate as an independent company. It has generally been considered to be one of the best-run oil companies, though it has recently made several poor investments in other countries, notably a loss-making venture in Canadian oil-sands.
The state also owns (passive) shares of every oil field on the Norwegian continental shelf through the SDFI (“Statens Direkt Finansiell Interesse” - The State’s direct financial interest). This is managed by the state-owned holding company Petoro, also located in Stavanger.
The south-western coastal city of Stavanger was chosen as the center of the petroleum industry, and Statoil opened up a headquarters in the city, together with the Norwegian Petroleum Directorate (established in 1972)
Rules were put in place establishing a minimum number of local employees to be used in the search and extraction process. Degrees in petroleum engineering and geology were established in local universities in order to build local competence.
A high resource tax was established under the idea that the petroleum resources that are being extracted are collectively owned by the people of Norway. This resource tax has been incredibly stable throughout its existence. It was established at 50% of extraction profits and stayed at 50% until 2014, when it was gradually raised to 56%, where it stands now. However, this increase was purely to compensate for a decrease in tax on profits which fell from 28% to 22% in this period.
Oil companies face a marginal tax rate on profits of 78% (22% tax on profits and 56% resource tax.) However, the tax law is designed to be investment “neutral.” That is, it is designed so that investments that would have taken place without the imposition of taxes take place after the imposition of taxes. Among the tax rules put in place include generous write-down provisions for search-costs.
In 1990, the oil fund (officially called the Norwegian Pension Fund Global) was established, though because of low oil prices, it was not until around 1996 that significant revenues started flowing into the fund. According to law, all direct government revenues from the oil industry (taxes + dividends from ownership stakes) is placed as principle in the fund. The fund invests exclusively in foreign stocks, bonds and property. This is meant as a way of avoiding an overly strong Norwegian krone, that could weaken the position of other export industries in Norway. It has also served as a form of currency hedge, as the the relative size of the fund grows during periods of financial upheaval, when the Norwegian kroner tends to fall in value. The oil fund currently owns around 1.5% of all the world’s listed companies (plus a sizeable amount of bonds and private property.) The Norwegian population of around 5 million makes up approximately .05% of the world population.
According to the “handlingsregel”, or “transaction rule”, the government can use an over-the-cycle average of 4% (subsequently lowered to 3%) of the fund in the yearly budget. Assuming a long-term rate of return of 3% in the fund, this means that the principle in the fund will not sink, even after revenues stop flowing into the fund, and that therefor the benefits of Norway’s petroleum will flow to future generations.
Drilling and extraction of oil and gas is subject to a formal and at times lengthy planning and regulatory process.
Opening. First, before an area on the NCS is opened to exploration and drilling, the Energy and Petroleum Department in the government needs to do an evaluation of the area that takes into account economic, social and environmental factors in opening the area to drilling.
Concessions/Licensing. Once an area is opened to drilling, the government divides the area into blocks, where a license/concession to explore and develop the block is auctioned off (generally, such auctions occur every other year.) Usually, the license is given to a group of companies, but the government also confirms an operator who will be responsible for actual drilling and exploration. The auctioning of concessions rounds are generally broken up into two categories: Numbered concessions and TFO concessions.
Numbered concessions are generally of unexplored areas, and are auctioned off sequentially. This is meant to make the exploration of unexplored areas as efficient as possible
TFO concessions are in mature areas, where there already exists a good amount of knowledge about geology, and a lot of infrastructure is already in place. Companies are free to bid on all available blocks in these areas during licensing rounds.
3.) Exploration. Once a license is obtained, then the licensee is given an initial 2 year period where they are required to explore the area (this to avoid, for example, a company strategically buying up and sitting on licenses in order exclude competitors.)
4.) Build-out and operation planning. If a licensee makes an exploitable find, then they can apply to to get an extension of their license in order to build-out and extract oil and/or gas. Extensions are usually for 30 years. In addition, the licensee must submit a detailed and lengthy plan for how they will build out and operate the field. This includes in-depth environmental impact report.
5.) Build-out and operation. Once a plan has been approved by the NPD, build-out can commence. In general, licensee-holders have an incentive to build-out as fast as they can, as developing an offshore oil-field is extremely capital-intensive. Thus the licensee-holders wish to start generating revenue from the field as soon as possible.
6.) Field closing. Licensees are required to submit a plan for field closing and clean up after extraction is done along with their plan for build-out and operation. The licensee is bound by law to these plans and must also set aside financing.
A nice feature in r is that you can load in data directly from the web, if it is in a machine readable format (.csv, .txt, etc).
We use the function read_csv from tidyverse to import data on oil fields on the Norwegian Continental Shelf. This data is freely available and regularly updated at the website of the Norwegian Petroleum Directorate, but for simplicity we will work with a dataset that I have already cleaned.
Note that there is also a read.csv function in base-r. These work in similar manners, but I think read_csv is a bit easier to use. You can read more about reading in data here
oil_fields = data.frame(read_csv("http://jmaurit.github.io/analytics/labs/data/oil_fields_cross.csv"))
## New names:
## Rows: 101 Columns: 8
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): name dbl (6): ...1, lon, lat, recoverable_oil, remaining_oil, total.invest
## date (1): producing_from
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
Now that we have imported the data and put it in the object called oil_fields, we can look at it in a few different ways. One is to click on the dataframe in the upper right-hand side box under environment and data. This brings up a window where you can scroll through the data.
You could also do this by typing in the command:
View(oil_fields)
Another option is to just display the first 5 rows of the data by typing:
head(oil_fields, 5)
## ...1 name lon lat recoverable_oil remaining_oil
## 1 1 33/9-6 DELTA NA NA NA NA
## 2 6 ALBUSKJELL 2.993019 56.63079 7.4 0.0
## 3 46 ALVE 7.827906 65.96924 1.9 0.8
## 4 54 ALVHEIM 2.010224 59.57465 37.2 17.5
## 5 64 ÅSGARD 6.728537 65.11635 100.4 18.6
## producing_from total.invest
## 1 <NA> 0
## 2 1979-05-26 2752
## 3 2009-03-19 3864
## 4 2008-06-08 17708
## 5 1999-05-19 65301
Or just write the object name:
oil_fields
## ...1 name lon lat recoverable_oil remaining_oil
## 1 1 33/9-6 DELTA NA NA NA NA
## 2 6 ALBUSKJELL 2.993019 56.63079 7.4 0.0
## 3 46 ALVE 7.827906 65.96924 1.9 0.8
## 4 54 ALVHEIM 2.010224 59.57465 37.2 17.5
## 5 64 ÅSGARD 6.728537 65.11635 100.4 18.6
## 6 82 ATLA 2.577873 59.67057 0.3 0.3
## 7 85 BALDER 2.446895 59.30631 72.1 16.2
## 8 109 BLANE 2.473427 56.88505 0.8 0.3
## 9 118 BØYLA 1.848627 59.32033 3.4 3.4
## 10 121 BRAGE 3.014216 60.58058 59.3 4.0
## 11 146 BRYNHILD 2.297317 57.49330 3.6 3.6
## 12 149 COD 2.422911 57.06647 2.9 0.0
## 13 190 DRAUGEN 7.756043 64.33243 146.7 14.9
## 14 216 EDDA 3.103594 56.47110 4.8 0.0
## 15 257 EKOFISK 3.218287 56.54663 569.2 129.8
## 16 301 ELDFISK 3.242870 56.39870 137.9 37.3
## 17 341 EMBLA 3.249919 56.32710 11.9 1.5
## 18 365 ENOCH 1.523260 58.64759 0.4 0.1
## 19 374 FRAM 3.528545 61.01547 30.7 5.9
## 20 391 FRIGG 2.094430 59.91108 0.0 0.0
## 21 428 FRØY 2.556764 59.73203 5.6 0.0
## 22 450 GAUPE 1.931471 57.96129 0.2 0.1
## 23 454 GIMLE 2.317010 61.24770 3.0 0.4
## 24 464 GINA KROG NA NA NA NA
## 25 468 GJØA 3.945822 61.33418 11.6 6.4
## 26 485 GLITNE 1.647140 58.71153 8.9 0.0
## 27 499 GOLIAT 22.296546 71.27406 30.2 30.2
## 28 504 GRANE 2.506074 59.18574 124.6 36.1
## 29 520 GUDRUN 1.751611 58.84938 11.7 11.7
## 30 524 GULLFAKS 2.130364 61.20896 365.5 11.6
## 31 557 GULLFAKS SØR 2.215286 61.10673 58.8 16.6
## 32 575 GUNGNE 1.917072 58.30961 0.0 0.0
## 33 594 GYDA 3.089733 56.85435 36.5 0.9
## 34 620 HEIDRUN 7.306980 65.32944 182.1 40.1
## 35 648 HEIMDAL 2.226203 59.58547 8.2 1.6
## 36 681 HOD 3.472108 56.19843 10.4 0.9
## 37 707 HULDRA 2.641797 60.85444 5.1 0.0
## 38 723 HYME 7.536641 64.34594 3.2 3.2
## 39 726 ISLAY NA NA 0.0 0.0
## 40 730 JETTE 2.359870 59.39335 1.5 1.5
## 41 733 JOTUN 2.325348 59.42954 23.4 0.7
## 42 752 KNARR 2.781421 61.80114 11.9 11.9
## 43 755 KRISTIN 6.433011 64.97139 22.9 6.0
## 44 768 KVITEBJØRN 2.509959 61.09226 27.3 9.8
## 45 782 LILLE-FRIGG 2.382012 59.95617 1.3 0.0
## 46 802 MARTIN LINGE 2.054545 60.50824 6.0 6.0
## 47 808 MARULK 7.586522 65.95250 0.7 0.6
## 48 813 MIKKEL 7.669056 64.68017 6.6 3.3
## 49 826 MIME 2.494998 57.12693 0.4 0.0
## 50 832 MORVIN 6.460434 65.15322 9.3 5.6
## 51 839 MURCHISON 1.744179 61.40133 13.9 0.0
## 52 879 NJORD 7.085968 64.28369 28.5 3.0
## 53 898 NORDØST FRIGG 2.273026 60.00514 0.0 0.0
## 54 917 NORNE 8.147307 66.03421 90.8 3.8
## 55 937 ODIN 2.172505 60.06075 0.0 0.0
## 56 960 ORMEN LANGE 5.349480 63.49258 0.0 0.0
## 57 971 OSEBERG 2.722425 60.54786 384.6 22.7
## 58 1002 OSEBERG ØST 2.978614 60.61807 26.7 7.9
## 59 1020 OSEBERG SØR 2.760004 60.25226 61.0 17.2
## 60 1039 OSELVAR 2.717955 56.93144 4.6 4.5
## 61 1044 ØST FRIGG 2.354428 59.91862 0.0 0.0
## 62 1068 REV 1.932832 58.02975 0.7 0.0
## 63 1081 RINGHORNE ØST 2.494802 59.29083 15.5 6.3
## 64 1090 SIGYN 2.018044 58.28605 0.0 0.0
## 65 1104 SKARV 7.627655 65.72231 15.5 15.5
## 66 1116 SKIRNE 2.491161 59.61747 2.2 0.5
## 67 1129 SKULD 8.314253 66.23375 13.4 13.4
## 68 1132 SLEIPNER ØST 1.967779 58.43631 0.0 0.0
## 69 1159 SLEIPNER VEST 1.730038 58.44112 0.0 0.0
## 70 1182 SNØHVIT 20.974280 71.49443 0.0 0.0
## 71 1194 SNORRE 2.193464 61.49258 249.9 64.1
## 72 1220 STATFJORD 1.866338 61.26222 570.4 4.3
## 73 1260 STATFJORD NORD 1.940215 61.44457 39.5 3.0
## 74 1284 STATFJORD ØST 2.003245 61.33880 36.8 0.7
## 75 1308 SYGNA 2.003040 61.47588 11.0 1.1
## 76 1323 TAMBAR 2.963593 56.97054 9.5 0.6
## 77 1337 TAMBAR ØST 3.002542 56.97172 0.3 0.0
## 78 1344 TOMMELITEN GAMMA 2.927906 56.48807 3.9 0.0
## 79 1369 TOR 3.329413 56.64148 24.3 0.4
## 80 1409 TORDIS 2.117935 61.28220 61.2 6.0
## 81 1432 TROLL 3.552057 60.82293 263.8 36.0
## 82 1460 TRYM 4.236097 56.40414 1.5 0.8
## 83 1465 TUNE 2.653474 60.43847 3.3 0.0
## 84 1481 TYRIHANS 7.000546 64.79362 32.4 16.5
## 85 1491 ULA 2.860582 57.10273 87.9 15.7
## 86 1522 URD 8.181120 66.05629 7.0 2.0
## 87 1533 VALE 2.302589 59.70584 2.4 1.1
## 88 1549 VALEMON 2.355820 61.04258 4.9 4.9
## 89 1553 VALHALL 3.393803 56.27577 147.4 41.5
## 90 1590 VARG 1.908065 58.07247 16.4 1.4
## 91 1608 VEGA 3.353838 61.35584 6.6 5.1
## 92 1616 VESLEFRIKK 2.891658 60.78827 54.1 1.8
## 93 1643 VEST EKOFISK 3.085798 56.56262 12.2 0.0
## 94 1684 VIGDIS 2.147171 61.40055 66.6 15.0
## 95 1704 VILJE 2.275522 59.66342 13.6 6.2
## 96 1713 VISUND 2.558184 61.41580 33.9 11.5
## 97 1732 VISUND SØR 2.345196 61.29676 2.7 2.7
## 98 1735 VOLUND 1.946607 59.48411 8.6 4.0
## 99 1745 VOLVE 1.890830 58.43905 8.7 1.0
## 100 1754 YME 4.362773 57.74885 22.0 14.1
## 101 1796 YTTERGRYTA 7.515016 65.11059 0.3 0.0
## producing_from total.invest
## 1 <NA> 0
## 2 1979-05-26 2752
## 3 2009-03-19 3864
## 4 2008-06-08 17708
## 5 1999-05-19 65301
## 6 2012-10-07 324
## 7 1999-10-02 23198
## 8 2007-09-12 544
## 9 <NA> 85
## 10 1993-09-23 17062
## 11 <NA> 210
## 12 1977-12-26 1003
## 13 1993-10-19 23291
## 14 1979-12-02 2157
## 15 1971-06-15 94172
## 16 1979-08-08 34963
## 17 1993-05-12 3129
## 18 2007-05-31 208
## 19 2003-10-02 9875
## 20 1977-09-13 8614
## 21 1995-05-15 5551
## 22 2012-03-31 2085
## 23 2006-05-19 780
## 24 <NA> 0
## 25 2010-11-07 28043
## 26 2001-08-29 2536
## 27 <NA> 10182
## 28 2003-09-23 19746
## 29 <NA> 5857
## 30 1986-12-22 74940
## 31 1998-10-10 28340
## 32 1996-04-21 1930
## 33 1990-06-21 12435
## 34 1995-10-18 52935
## 35 1985-12-13 10064
## 36 1990-09-30 2254
## 37 2001-11-21 7460
## 38 2013-03-02 897
## 39 2012-04-10 578
## 40 2013-05-20 263
## 41 1999-10-25 9484
## 42 <NA> 534
## 43 2005-11-03 24467
## 44 2004-09-26 13707
## 45 1994-05-13 3923
## 46 <NA> 402
## 47 2012-04-02 2173
## 48 2003-08-01 2182
## 49 1993-01-01 324
## 50 2010-08-01 7994
## 51 1980-09-28 2597
## 52 1997-09-30 14888
## 53 1983-12-01 1816
## 54 1997-11-06 23806
## 55 1984-04-01 2614
## 56 2007-09-13 35967
## 57 1988-12-01 66622
## 58 1999-05-03 8827
## 59 2000-02-05 20108
## 60 2012-04-14 3535
## 61 1988-10-01 2186
## 62 2009-01-24 3831
## 63 2006-03-19 738
## 64 2002-12-22 2066
## 65 2013-01-01 37119
## 66 2004-03-03 2685
## 67 2013-03-19 1434
## 68 1993-08-24 26756
## 69 1996-08-29 23176
## 70 2007-08-21 8235
## 71 1992-08-03 65475
## 72 1979-11-24 65264
## 73 1995-01-23 5719
## 74 1994-09-24 6015
## 75 2000-08-01 1984
## 76 2001-07-15 2212
## 77 2007-10-02 1034
## 78 1988-10-03 2406
## 79 1978-06-28 3934
## 80 1994-06-03 12201
## 81 1995-09-19 130620
## 82 2011-02-12 2956
## 83 2002-11-28 4746
## 84 2009-07-08 15203
## 85 1986-10-06 14700
## 86 2005-11-08 5073
## 87 2002-05-31 2495
## 88 <NA> 1484
## 89 1982-10-02 61511
## 90 1998-12-22 7880
## 91 2010-12-02 8503
## 92 1989-12-26 16660
## 93 1977-05-31 943
## 94 1997-01-28 14829
## 95 2008-08-01 1947
## 96 1999-04-21 22136
## 97 2012-11-22 826
## 98 2009-09-10 3185
## 99 2008-02-12 3044
## 100 1996-02-27 23438
## 101 2009-01-05 1460
Oil_fields is an object of a type called data frame. We can think of this as R’s representation of a typical spread-sheet (like Excel) with columns of different categories with different types of data (integer, real numbers (float), and text). Tibble is a special type of data.frame that comes from tidyverse. It has some minor advantages over the built-in r data frame format. But In large part they function similarly. You can read more about tibbles here.
If we want to look at the last few rows of our data frame, we can use the command tail:
tail(oil_fields, n=10)
If we want to get an overview of our data frame, we can use the str function.
str(oil_fields)
If we want to access an individual variable, we can use the $ character:
head(oil_fields$recoverable_oil)
Using the class command, we can see what format a certain variable is in (which we could also see from str for all the variables):
class(oil_fields$producing_from)
This is one of the nice things about the read_csv command from tidyverse, here we see that it automatically recognized the data as a date and converted to date format. Date’s can sometimes be tricky to work with in R. Here is an overview.
Now we can start working with our data.
Lets make a plot of the total amount of recoverable oil in a field and the date it started producing from:
We’ll use the function, ggplot
oil_fields %>% ggplot(aes(x=producing_from, y=recoverable_oil)) +
geom_point()
Let’s go carefully through this command.
First, within the ggplot command I specify which dataset (data frame) I want to use. In this case it is oil_fields.
Within ggplot there is a sub-field calles aes, which stands for aesthetics. Basically, this specifies how we are to translate the numerical data in our data set into a visual format. Here we say that we will represent producing_from as horizontal distance (x-axis) and recoverable_oil as vertical distance (y-axis). As we’ll see, there are other ways we can represent data as well, such as color, and marker size.
When this data is set, we want to add a layer. So we type + and then geom_point. This says we want to represent the data in the form of points. There are many different geoms to choose from, and often you will choose several. You can read more here or from the data-visualisation chapter of R for data science..
We can tidy up our chart a little by making our labels a bit more understandable.
ggplot(oil_fields, aes(x=producing_from, y=recoverable_oil)) +
geom_point() +
labs(x="", y="Recoverable Oil, Million SM3")
Now we state the variable and unit on the y-axis. SM3 stands for standard cubic meter at 15 degrees celsius and under normal atmospheric pressure.
We can also look at a histogram to more clearly see the distribution of finds
ggplot(oil_fields, aes(x=recoverable_oil)) +
geom_histogram()
Here we see clearly that most finds tend to be small, but we have a “fat tail”, with a few super large fields. The outliers you see here are the Statfjord and Ekofisk fields, which were also the first two fields to be found.
Now lets say we want to create a new variable: extracted, which is the difference between the total amount of recoverable oil in each field and the remaining oil in the field (this data is a few years old so the amounts have changed a bit.)
oil_fields$extracted=oil_fields$recoverable_oil-oil_fields$remaining_oil
Now lets create another variable which gives us amount of investment per extracted SM3 of oil:
oil_fields$invest_per_millsm3 = oil_fields$total.invest/oil_fields$extracted
#or
#oil_fields["invest_per_millsm3"]=oil_fields$total.invest/oil_fields$extracted
In the field above, notice that I have written two ways of creating a new variable (corresponding to two ways of accessing a certain variable/column). You can either use the $ marker or use block quotes with the column name in quotes: [“var name”]. The latter method works for any type of variable name, where the former only works if the variable is one word.
Lets try a plot
ggplot(oil_fields, aes(x=recoverable_oil, y=invest_per_millsm3)) +
geom_point()
What does this plot tell us? Are there problems with the variables we have chosen to plot and the way we have chosen to plot them?
Lets try plotting investment per extracted SM3 of oil against the start of production date.
ggplot(oil_fields, aes(x=producing_from, y=log(invest_per_millsm3))) +
geom_point()
Notice here that I transform the y-variable to log. This makes the data a bit more linear and easier to interpret. Basically you can judge the slope of a line as a rate of growth.
What we notice here is that investment per production is lowest in the oldest fields. This has a few different causes.
The most important is that investments (and costs) are front-loaded. A large portion of the costs of extracting oil and gas off-shore is in the initial build-out of the field and related infrastructure. The actual running costs once the field is up-and-running is relatively low. Thus, we see, that costs per extracted oil are exponentially higher for newer fields.
This fact is very important in understanding Norwegian taxation policy. The marginal taxation rate on profits from oil and gas extraction is very high: nearly 80 percent: 22% corporation tax + 56% resource tax (plus other environmental fees, auction rights, etc).
However, petroleum extraction companies are given special taxation rules aimed at making investment in the petroleum industry “neutral.” That means, taxation is designed such that investments that would have been undertaken before tax, would also be taken pre-tax. A few hallmarks of the “neutral” tax system include:
You can read more about taxation from the Norwegian Petroleum Institute
What does the special tax treatment imply about the initial investment costs of Norwegian offshore oil and gas investments?
Who is in effect paying for most of the upfront investment costs? Is this warranted/problematic?
If we know that there was a high risk that oil and gas production would not be profitable (or allowed) in the near future (10-30 years), would the tax system be neutral?
Lets take a look at the largest of the oil fields, the ones that have more than 50 million SM3 of recoverable reserves.
We use the function filter from tidyverse
big_oil_fields = oil_fields %>% filter(recoverable_oil>50)
#or
#big_oil_fields = oil_fields[oil_fields$recoverable_oil>50, ]
The filter command says take the data frame oil_fields and take only the rows where the variable recoverable_oil is more than 50.
There is a similar way to do this in base r, where you use the block quotes. Here the format is mydataframe[rows, columns], thus the code above says to include all rows where the given variable is above 50, and by leaving the columns portion blank, we tell r to include all columns.
Now I can get the names of all the big oil fields:
big_oil_fields$name
## [1] "ÅSGARD" "BALDER" "BRAGE" "DRAUGEN" "EKOFISK"
## [6] "ELDFISK" "GRANE" "GULLFAKS" "GULLFAKS SØR" "HEIDRUN"
## [11] "NORNE" "OSEBERG" "OSEBERG SØR" "SNORRE" "STATFJORD"
## [16] "TORDIS" "TROLL" "ULA" "VALHALL" "VESLEFRIKK"
## [21] "VIGDIS"
But this is in alphabetical order, which isn’t helpful. Let’s rearrange by total recoverable oil. We will use the arrange function from tidyverse
big_oil_fields = arrange(big_oil_fields, desc(recoverable_oil))
big_oil_fields
## ...1 name lon lat recoverable_oil remaining_oil
## 1 1220 STATFJORD 1.866338 61.26222 570.4 4.3
## 2 257 EKOFISK 3.218287 56.54663 569.2 129.8
## 3 971 OSEBERG 2.722425 60.54786 384.6 22.7
## 4 524 GULLFAKS 2.130364 61.20896 365.5 11.6
## 5 1432 TROLL 3.552057 60.82293 263.8 36.0
## 6 1194 SNORRE 2.193464 61.49258 249.9 64.1
## 7 620 HEIDRUN 7.306980 65.32944 182.1 40.1
## 8 1553 VALHALL 3.393803 56.27577 147.4 41.5
## 9 190 DRAUGEN 7.756043 64.33243 146.7 14.9
## 10 301 ELDFISK 3.242870 56.39870 137.9 37.3
## 11 504 GRANE 2.506074 59.18574 124.6 36.1
## 12 64 ÅSGARD 6.728537 65.11635 100.4 18.6
## 13 917 NORNE 8.147307 66.03421 90.8 3.8
## 14 1491 ULA 2.860582 57.10273 87.9 15.7
## 15 85 BALDER 2.446895 59.30631 72.1 16.2
## 16 1684 VIGDIS 2.147171 61.40055 66.6 15.0
## 17 1409 TORDIS 2.117935 61.28220 61.2 6.0
## 18 1020 OSEBERG SØR 2.760004 60.25226 61.0 17.2
## 19 121 BRAGE 3.014216 60.58058 59.3 4.0
## 20 557 GULLFAKS SØR 2.215286 61.10673 58.8 16.6
## 21 1616 VESLEFRIKK 2.891658 60.78827 54.1 1.8
## producing_from total.invest
## 1 1979-11-24 65264
## 2 1971-06-15 94172
## 3 1988-12-01 66622
## 4 1986-12-22 74940
## 5 1995-09-19 130620
## 6 1992-08-03 65475
## 7 1995-10-18 52935
## 8 1982-10-02 61511
## 9 1993-10-19 23291
## 10 1979-08-08 34963
## 11 2003-09-23 19746
## 12 1999-05-19 65301
## 13 1997-11-06 23806
## 14 1986-10-06 14700
## 15 1999-10-02 23198
## 16 1997-01-28 14829
## 17 1994-06-03 12201
## 18 2000-02-05 20108
## 19 1993-09-23 17062
## 20 1998-10-10 28340
## 21 1989-12-26 16660
ggplot(big_oil_fields, aes(x=reorder(name, -recoverable_oil), y=recoverable_oil)) +
geom_col() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x="", y="Recoverable Oil, SM3")
Notice a few things.
First I have used the reorder command within ggplot to tell ggplot to order from highest to lowest. Otherwise it would revert to alphabetical (try it!).
Second, I use the theme() function to rotate the text on the x-axis so it would be readable. theme() lets you change all sorts of details about your graph if you wish. I always end up having to search for the correct parameters to use, but the answer is usually not hard to find.
In this section I want to show some of the things you can do with R, but which would be hard or impossible with other software like Excel, Stata, etc. I won’t explain all the code, but feel free to go through carefully and see if you can understand the commands.
In our data set we have latitude and longitude of all our oil fields, so we can go ahead and try to make a map of our fields.
First, we will install a package that easily allows us to create maps, called ggmaps
We install the package.
install.packages("ggmap")
#if necessary run: install.packages("devtools")
#library(devtools)
#devtools::install_github("dkahle/ggmap")
Then we also need to load the package.
library(ggmap)
## ℹ Google's Terms of Service: <]8;;https://mapsplatform.google.comhttps://mapsplatform.google.com]8;;>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
ggmap uses map data from stamen maps. The command get_stamenmap() imports a map of a certain area.
northseaCoord= c(left = -5, bottom = 55, right = 10, top = 67)
northseaMap = get_stamenmap(northseaCoord, zoom = 5, maptype = "toner-lite")
Then we can combine our map with the data in our data frame:
ggmap(northseaMap) +
geom_point(aes(x = lon, y = lat, size=recoverable_oil, color=extracted),alpha=.7, data = oil_fields)+
scale_color_continuous(low="red", high="black")
In this course you will be regularily working with assignments as well as a term project. I want you to write these in R markdown format. This is a format that allows you to combine text, equations, code, and results all in one document. In fact, it is R markdown that I used to write these labs.
You can read a bit more about R Markdown here.
And download a cheat sheet here
You already have all the tools you need to get started with R-markdown. In R-studio, go to File -> New File -> R Markdown….
In the pop-up window you can give it an informative title
Choose pdf as desired output, since you will be handing in the output
Creating markdown is pretty straight-forward:
You write text more-or-less as you would in a normal word processor.
Instead of formatting from a menu however, you use characters in the markdown file.
A “*” with a space following creates a bullet-point
Putting text between asteriks: “*text *” makes the text italics
You can refer to the cheat sheet for more styling
To insert code to be run, you can go to Code -> Insert Chunk, then write the code within the box that has been created. It is even better to learn the hot key for inserting code. On a Mac this is Command + Option + i
print("here is some code")
## [1] "here is some code"
There are many possible workflows for doing an analysis in R. But I have a preference for doing an analysis within a markdown text. I run chunks of code, then make notes about the results within the same document.
To render the document as a pdf click on the “knit” button in r-studio (or go to file -> Knit Document).
R-studio requires a version of tex/latex (a technical document rendering program) to create the pdf. If you don’t have it on your computer, the easiest way to install a minimal tex version is to run the following commands in r:
install.packages('tinytex')
tinytex::install_tinytex()
Create a variable invest_per_rec which is investment per million sm3 in recoverable oil (recoverable_oil). Plot these variables against producing_from variable and the original recoverable_oil variable. How do you interpret the results?
Create a list of the 5 “cheapest” oil fields, that is where the investment is lowest per recoverable oil. What do these tend to have in common?
I have a hypothesis that oil fields farther north are more expensive to exploit. Explore this hypothesis. Do you think it has merit?
Open-ended question: Accessing and importing data
Go to the data portion of the Norwegian Petrioleum Directorate
The tabs at the top indicate the different types of data that is available by level/theme. Try to find some interesting dataset and download it as a .csv file. (Hint, on the left-hand panel, go down to “table view”, then you get a table of data, which you can export by clicking on “Export CSV”).
Once you have downloaded the data, import the data into r using the read_csv() command.
If there is a date variable, format that variable as a date (if read_csv() hasn’t automatically done so already)
Plot the data in a meaningful way. Interpret the plot. Is there anything puzzling about the data.
(This question is vague and open-ended. I know, that is the point. It reflects the real-life messiness of starting an analysis with a blank page. Don’t get too frustrated, and try to have some fun.)
p.s. The field or wellbore tabs might be the most interesting tabs to explore. For the data sets in the field tab, the date is provided as separate columns for month and year. You probably want to combine these into a single date variable. Here is some code to help (I told you working with dates can be tricky):
#Install package lubridate if you need to:
#install.packages("lubridate")
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
#DF: the name of my dataframe
#month: name of column with month data
#year: name of column with year data
#create a data frame with fake data
DF = tibble(
"month" = 1:6,
"year" = 2012:2017
)
#convert to strings
DF["month"] = as.character(DF$month)
DF["year"] = as.character(DF$year)
DF$date = str_c("1", DF$month, DF$year, sep="-") %>% dmy() %>% as.Date()
#DF
DF
## # A tibble: 6 × 3
## month year date
## <chr> <chr> <date>
## 1 1 2012 2012-01-01
## 2 2 2013 2013-02-01
## 3 3 2014 2014-03-01
## 4 4 2015 2015-04-01
## 5 5 2016 2016-05-01
## 6 6 2017 2017-06-01