Background reading:

R for Data Science: Ch 1 through 4, 7

R and the Tidyverse

In this lab, we will get an introduction in how to import, organize and visualize data. There is a steep learning curve if you haven’t used R or similar type of program before, so don’t worry if you feel a bit frustrated in the beginning.

In this lab, and throughout the course I will be referring to the online book, R for Data Science. The web version of this book is free to use.

We will be using a group of packages, called the tidyverse. These generally help make working with data in R easier.

If we haven’t installed tidyverse on our computer before, we run:

install.packages("tidyverse")

And then we need to open up the packages before we can use them:

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

tidyverse consists of the following packages:

Norwegian Petroleum Industry and the Norwegian Continental Shelf: A primer

Significant amounts of oil and gas was first found on the Norwegian Continental Shelf (NCS) on December 24th of 1969, when a “wild cat” (or exploratory) drilling by Phillips petroleum in effect discovered what became known as the Ekofisk field in the North Sea. Ekofisk is, to this day, the largest oil field discovery on the NCS. Subsequent large fields were found, notably the second largest field, Statfjord.

In the early days of the Norwegian industry, foreign and especially American firms tended to dominate. The American firms, such as Exxon, had the most know-how about drilling and extraction in difficult deep water environments that came from their experience in the Gulf of Mexico. But the stormy Norwegian portion of the North Sea provided new challenges to these firms.

From the beginning, the Norwegian government had plans to make sure that the benefits of the oil and gas finds were widely distributed, and led to a long term economic development in Norway.

Oil companies face a marginal tax rate on profits of 78% (22% tax on profits and 56% resource tax.) However, the tax law is designed to be investment “neutral.” That is, it is designed so that investments that would have taken place without the imposition of taxes take place after the imposition of taxes. Among the tax rules put in place include generous write-down provisions for search-costs.

The oil fund

In 1990, the oil fund (officially called the Norwegian Pension Fund Global) was established, though because of low oil prices, it was not until around 1996 that significant revenues started flowing into the fund. According to law, all direct government revenues from the oil industry (taxes + dividends from ownership stakes) is placed as principle in the fund. The fund invests exclusively in foreign stocks, bonds and property. This is meant as a way of avoiding an overly strong Norwegian krone, that could weaken the position of other export industries in Norway. It has also served as a form of currency hedge, as the the relative size of the fund grows during periods of financial upheaval, when the Norwegian kroner tends to fall in value. The oil fund currently owns around 1.5% of all the world’s listed companies (plus a sizeable amount of bonds and private property.) The Norwegian population of around 5 million makes up approximately .05% of the world population.

According to the “handlingsregel”, or “transaction rule”, the government can use an over-the-cycle average of 4% (subsequently lowered to 3%) of the fund in the yearly budget. Assuming a long-term rate of return of 3% in the fund, this means that the principle in the fund will not sink, even after revenues stop flowing into the fund, and that therefor the benefits of Norway’s petroleum will flow to future generations.

Licences, auctions, and regulations

Drilling and extraction of oil and gas is subject to a formal and at times lengthy planning and regulatory process.

  1. Opening. First, before an area on the NCS is opened to exploration and drilling, the Energy and Petroleum Department in the government needs to do an evaluation of the area that takes into account economic, social and environmental factors in opening the area to drilling.

  2. Concessions/Licensing. Once an area is opened to drilling, the government divides the area into blocks, where a license/concession to explore and develop the block is auctioned off (generally, such auctions occur every other year.) Usually, the license is given to a group of companies, but the government also confirms an operator who will be responsible for actual drilling and exploration. The auctioning of concessions rounds are generally broken up into two categories: Numbered concessions and TFO concessions.

  • Numbered concessions are generally of unexplored areas, and are auctioned off sequentially. This is meant to make the exploration of unexplored areas as efficient as possible

  • TFO concessions are in mature areas, where there already exists a good amount of knowledge about geology, and a lot of infrastructure is already in place. Companies are free to bid on all available blocks in these areas during licensing rounds.

3.) Exploration. Once a license is obtained, then the licensee is given an initial 2 year period where they are required to explore the area (this to avoid, for example, a company strategically buying up and sitting on licenses in order exclude competitors.)

4.) Build-out and operation planning. If a licensee makes an exploitable find, then they can apply to to get an extension of their license in order to build-out and extract oil and/or gas. Extensions are usually for 30 years. In addition, the licensee must submit a detailed and lengthy plan for how they will build out and operate the field. This includes in-depth environmental impact report.

5.) Build-out and operation. Once a plan has been approved by the NPD, build-out can commence. In general, licensee-holders have an incentive to build-out as fast as they can, as developing an offshore oil-field is extremely capital-intensive. Thus the licensee-holders wish to start generating revenue from the field as soon as possible.

6.) Field closing. Licensees are required to submit a plan for field closing and clean up after extraction is done along with their plan for build-out and operation. The licensee is bound by law to these plans and must also set aside financing.

Importing data on Petroleum in the Norwegian Continental Shelf

A nice feature in r is that you can load in data directly from the web, if it is in a machine readable format (.csv, .txt, etc).

We use the function read_csv from tidyverse to import data on oil fields on the Norwegian Continental Shelf. This data is freely available and regularly updated at the website of the Norwegian Petroleum Directorate, but for simplicity we will work with a dataset that I have already cleaned.

Note that there is also a read.csv function in base-r. These work in similar manners, but I think read_csv is a bit easier to use. You can read more about reading in data here

oil_fields =  data.frame(read_csv("http://jmaurit.github.io/analytics/labs/data/oil_fields_cross.csv"))
## New names:
## Rows: 101 Columns: 8
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): name dbl (6): ...1, lon, lat, recoverable_oil, remaining_oil, total.invest
## date (1): producing_from
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

Now that we have imported the data and put it in the object called oil_fields, we can look at it in a few different ways. One is to click on the dataframe in the upper right-hand side box under environment and data. This brings up a window where you can scroll through the data.

You could also do this by typing in the command:

View(oil_fields)

Another option is to just display the first 5 rows of the data by typing:

head(oil_fields, 5)
##   ...1         name      lon      lat recoverable_oil remaining_oil
## 1    1 33/9-6 DELTA       NA       NA              NA            NA
## 2    6   ALBUSKJELL 2.993019 56.63079             7.4           0.0
## 3   46         ALVE 7.827906 65.96924             1.9           0.8
## 4   54      ALVHEIM 2.010224 59.57465            37.2          17.5
## 5   64       ÅSGARD 6.728537 65.11635           100.4          18.6
##   producing_from total.invest
## 1           <NA>            0
## 2     1979-05-26         2752
## 3     2009-03-19         3864
## 4     2008-06-08        17708
## 5     1999-05-19        65301

Or just write the object name:

oil_fields
##     ...1             name       lon      lat recoverable_oil remaining_oil
## 1      1     33/9-6 DELTA        NA       NA              NA            NA
## 2      6       ALBUSKJELL  2.993019 56.63079             7.4           0.0
## 3     46             ALVE  7.827906 65.96924             1.9           0.8
## 4     54          ALVHEIM  2.010224 59.57465            37.2          17.5
## 5     64           ÅSGARD  6.728537 65.11635           100.4          18.6
## 6     82             ATLA  2.577873 59.67057             0.3           0.3
## 7     85           BALDER  2.446895 59.30631            72.1          16.2
## 8    109            BLANE  2.473427 56.88505             0.8           0.3
## 9    118            BØYLA  1.848627 59.32033             3.4           3.4
## 10   121            BRAGE  3.014216 60.58058            59.3           4.0
## 11   146         BRYNHILD  2.297317 57.49330             3.6           3.6
## 12   149              COD  2.422911 57.06647             2.9           0.0
## 13   190          DRAUGEN  7.756043 64.33243           146.7          14.9
## 14   216             EDDA  3.103594 56.47110             4.8           0.0
## 15   257          EKOFISK  3.218287 56.54663           569.2         129.8
## 16   301          ELDFISK  3.242870 56.39870           137.9          37.3
## 17   341            EMBLA  3.249919 56.32710            11.9           1.5
## 18   365            ENOCH  1.523260 58.64759             0.4           0.1
## 19   374             FRAM  3.528545 61.01547            30.7           5.9
## 20   391            FRIGG  2.094430 59.91108             0.0           0.0
## 21   428             FRØY  2.556764 59.73203             5.6           0.0
## 22   450            GAUPE  1.931471 57.96129             0.2           0.1
## 23   454            GIMLE  2.317010 61.24770             3.0           0.4
## 24   464        GINA KROG        NA       NA              NA            NA
## 25   468             GJØA  3.945822 61.33418            11.6           6.4
## 26   485           GLITNE  1.647140 58.71153             8.9           0.0
## 27   499           GOLIAT 22.296546 71.27406            30.2          30.2
## 28   504            GRANE  2.506074 59.18574           124.6          36.1
## 29   520           GUDRUN  1.751611 58.84938            11.7          11.7
## 30   524         GULLFAKS  2.130364 61.20896           365.5          11.6
## 31   557     GULLFAKS SØR  2.215286 61.10673            58.8          16.6
## 32   575           GUNGNE  1.917072 58.30961             0.0           0.0
## 33   594             GYDA  3.089733 56.85435            36.5           0.9
## 34   620          HEIDRUN  7.306980 65.32944           182.1          40.1
## 35   648          HEIMDAL  2.226203 59.58547             8.2           1.6
## 36   681              HOD  3.472108 56.19843            10.4           0.9
## 37   707           HULDRA  2.641797 60.85444             5.1           0.0
## 38   723             HYME  7.536641 64.34594             3.2           3.2
## 39   726            ISLAY        NA       NA             0.0           0.0
## 40   730            JETTE  2.359870 59.39335             1.5           1.5
## 41   733            JOTUN  2.325348 59.42954            23.4           0.7
## 42   752            KNARR  2.781421 61.80114            11.9          11.9
## 43   755          KRISTIN  6.433011 64.97139            22.9           6.0
## 44   768       KVITEBJØRN  2.509959 61.09226            27.3           9.8
## 45   782      LILLE-FRIGG  2.382012 59.95617             1.3           0.0
## 46   802     MARTIN LINGE  2.054545 60.50824             6.0           6.0
## 47   808           MARULK  7.586522 65.95250             0.7           0.6
## 48   813           MIKKEL  7.669056 64.68017             6.6           3.3
## 49   826             MIME  2.494998 57.12693             0.4           0.0
## 50   832           MORVIN  6.460434 65.15322             9.3           5.6
## 51   839        MURCHISON  1.744179 61.40133            13.9           0.0
## 52   879            NJORD  7.085968 64.28369            28.5           3.0
## 53   898    NORDØST FRIGG  2.273026 60.00514             0.0           0.0
## 54   917            NORNE  8.147307 66.03421            90.8           3.8
## 55   937             ODIN  2.172505 60.06075             0.0           0.0
## 56   960      ORMEN LANGE  5.349480 63.49258             0.0           0.0
## 57   971          OSEBERG  2.722425 60.54786           384.6          22.7
## 58  1002      OSEBERG ØST  2.978614 60.61807            26.7           7.9
## 59  1020      OSEBERG SØR  2.760004 60.25226            61.0          17.2
## 60  1039          OSELVAR  2.717955 56.93144             4.6           4.5
## 61  1044        ØST FRIGG  2.354428 59.91862             0.0           0.0
## 62  1068              REV  1.932832 58.02975             0.7           0.0
## 63  1081    RINGHORNE ØST  2.494802 59.29083            15.5           6.3
## 64  1090            SIGYN  2.018044 58.28605             0.0           0.0
## 65  1104            SKARV  7.627655 65.72231            15.5          15.5
## 66  1116           SKIRNE  2.491161 59.61747             2.2           0.5
## 67  1129            SKULD  8.314253 66.23375            13.4          13.4
## 68  1132     SLEIPNER ØST  1.967779 58.43631             0.0           0.0
## 69  1159    SLEIPNER VEST  1.730038 58.44112             0.0           0.0
## 70  1182          SNØHVIT 20.974280 71.49443             0.0           0.0
## 71  1194           SNORRE  2.193464 61.49258           249.9          64.1
## 72  1220        STATFJORD  1.866338 61.26222           570.4           4.3
## 73  1260   STATFJORD NORD  1.940215 61.44457            39.5           3.0
## 74  1284    STATFJORD ØST  2.003245 61.33880            36.8           0.7
## 75  1308            SYGNA  2.003040 61.47588            11.0           1.1
## 76  1323           TAMBAR  2.963593 56.97054             9.5           0.6
## 77  1337       TAMBAR ØST  3.002542 56.97172             0.3           0.0
## 78  1344 TOMMELITEN GAMMA  2.927906 56.48807             3.9           0.0
## 79  1369              TOR  3.329413 56.64148            24.3           0.4
## 80  1409           TORDIS  2.117935 61.28220            61.2           6.0
## 81  1432            TROLL  3.552057 60.82293           263.8          36.0
## 82  1460             TRYM  4.236097 56.40414             1.5           0.8
## 83  1465             TUNE  2.653474 60.43847             3.3           0.0
## 84  1481         TYRIHANS  7.000546 64.79362            32.4          16.5
## 85  1491              ULA  2.860582 57.10273            87.9          15.7
## 86  1522              URD  8.181120 66.05629             7.0           2.0
## 87  1533             VALE  2.302589 59.70584             2.4           1.1
## 88  1549          VALEMON  2.355820 61.04258             4.9           4.9
## 89  1553          VALHALL  3.393803 56.27577           147.4          41.5
## 90  1590             VARG  1.908065 58.07247            16.4           1.4
## 91  1608             VEGA  3.353838 61.35584             6.6           5.1
## 92  1616       VESLEFRIKK  2.891658 60.78827            54.1           1.8
## 93  1643     VEST EKOFISK  3.085798 56.56262            12.2           0.0
## 94  1684           VIGDIS  2.147171 61.40055            66.6          15.0
## 95  1704            VILJE  2.275522 59.66342            13.6           6.2
## 96  1713           VISUND  2.558184 61.41580            33.9          11.5
## 97  1732       VISUND SØR  2.345196 61.29676             2.7           2.7
## 98  1735           VOLUND  1.946607 59.48411             8.6           4.0
## 99  1745            VOLVE  1.890830 58.43905             8.7           1.0
## 100 1754              YME  4.362773 57.74885            22.0          14.1
## 101 1796       YTTERGRYTA  7.515016 65.11059             0.3           0.0
##     producing_from total.invest
## 1             <NA>            0
## 2       1979-05-26         2752
## 3       2009-03-19         3864
## 4       2008-06-08        17708
## 5       1999-05-19        65301
## 6       2012-10-07          324
## 7       1999-10-02        23198
## 8       2007-09-12          544
## 9             <NA>           85
## 10      1993-09-23        17062
## 11            <NA>          210
## 12      1977-12-26         1003
## 13      1993-10-19        23291
## 14      1979-12-02         2157
## 15      1971-06-15        94172
## 16      1979-08-08        34963
## 17      1993-05-12         3129
## 18      2007-05-31          208
## 19      2003-10-02         9875
## 20      1977-09-13         8614
## 21      1995-05-15         5551
## 22      2012-03-31         2085
## 23      2006-05-19          780
## 24            <NA>            0
## 25      2010-11-07        28043
## 26      2001-08-29         2536
## 27            <NA>        10182
## 28      2003-09-23        19746
## 29            <NA>         5857
## 30      1986-12-22        74940
## 31      1998-10-10        28340
## 32      1996-04-21         1930
## 33      1990-06-21        12435
## 34      1995-10-18        52935
## 35      1985-12-13        10064
## 36      1990-09-30         2254
## 37      2001-11-21         7460
## 38      2013-03-02          897
## 39      2012-04-10          578
## 40      2013-05-20          263
## 41      1999-10-25         9484
## 42            <NA>          534
## 43      2005-11-03        24467
## 44      2004-09-26        13707
## 45      1994-05-13         3923
## 46            <NA>          402
## 47      2012-04-02         2173
## 48      2003-08-01         2182
## 49      1993-01-01          324
## 50      2010-08-01         7994
## 51      1980-09-28         2597
## 52      1997-09-30        14888
## 53      1983-12-01         1816
## 54      1997-11-06        23806
## 55      1984-04-01         2614
## 56      2007-09-13        35967
## 57      1988-12-01        66622
## 58      1999-05-03         8827
## 59      2000-02-05        20108
## 60      2012-04-14         3535
## 61      1988-10-01         2186
## 62      2009-01-24         3831
## 63      2006-03-19          738
## 64      2002-12-22         2066
## 65      2013-01-01        37119
## 66      2004-03-03         2685
## 67      2013-03-19         1434
## 68      1993-08-24        26756
## 69      1996-08-29        23176
## 70      2007-08-21         8235
## 71      1992-08-03        65475
## 72      1979-11-24        65264
## 73      1995-01-23         5719
## 74      1994-09-24         6015
## 75      2000-08-01         1984
## 76      2001-07-15         2212
## 77      2007-10-02         1034
## 78      1988-10-03         2406
## 79      1978-06-28         3934
## 80      1994-06-03        12201
## 81      1995-09-19       130620
## 82      2011-02-12         2956
## 83      2002-11-28         4746
## 84      2009-07-08        15203
## 85      1986-10-06        14700
## 86      2005-11-08         5073
## 87      2002-05-31         2495
## 88            <NA>         1484
## 89      1982-10-02        61511
## 90      1998-12-22         7880
## 91      2010-12-02         8503
## 92      1989-12-26        16660
## 93      1977-05-31          943
## 94      1997-01-28        14829
## 95      2008-08-01         1947
## 96      1999-04-21        22136
## 97      2012-11-22          826
## 98      2009-09-10         3185
## 99      2008-02-12         3044
## 100     1996-02-27        23438
## 101     2009-01-05         1460

Oil_fields is an object of a type called data frame. We can think of this as R’s representation of a typical spread-sheet (like Excel) with columns of different categories with different types of data (integer, real numbers (float), and text). Tibble is a special type of data.frame that comes from tidyverse. It has some minor advantages over the built-in r data frame format. But In large part they function similarly. You can read more about tibbles here.

If we want to look at the last few rows of our data frame, we can use the command tail:

tail(oil_fields, n=10)

If we want to get an overview of our data frame, we can use the str function.

str(oil_fields)

If we want to access an individual variable, we can use the $ character:

head(oil_fields$recoverable_oil)

Using the class command, we can see what format a certain variable is in (which we could also see from str for all the variables):

class(oil_fields$producing_from)

This is one of the nice things about the read_csv command from tidyverse, here we see that it automatically recognized the data as a date and converted to date format. Date’s can sometimes be tricky to work with in R. Here is an overview.

Now we can start working with our data.

Lets make a plot of the total amount of recoverable oil in a field and the date it started producing from:

We’ll use the function, ggplot

oil_fields %>% ggplot(aes(x=producing_from, y=recoverable_oil)) +
geom_point()

Let’s go carefully through this command.

First, within the ggplot command I specify which dataset (data frame) I want to use. In this case it is oil_fields.

Within ggplot there is a sub-field calles aes, which stands for aesthetics. Basically, this specifies how we are to translate the numerical data in our data set into a visual format. Here we say that we will represent producing_from as horizontal distance (x-axis) and recoverable_oil as vertical distance (y-axis). As we’ll see, there are other ways we can represent data as well, such as color, and marker size.

When this data is set, we want to add a layer. So we type + and then geom_point. This says we want to represent the data in the form of points. There are many different geoms to choose from, and often you will choose several. You can read more here or from the data-visualisation chapter of R for data science..

Exercises:

  • What does this figure say about the distribution of oil finds through time?
  • What we see on the graph is something that is called creaming in the industry. Can you explain this?

We can tidy up our chart a little by making our labels a bit more understandable.

ggplot(oil_fields, aes(x=producing_from, y=recoverable_oil)) +
geom_point() +
labs(x="", y="Recoverable Oil, Million SM3")

Now we state the variable and unit on the y-axis. SM3 stands for standard cubic meter at 15 degrees celsius and under normal atmospheric pressure.

We can also look at a histogram to more clearly see the distribution of finds

ggplot(oil_fields, aes(x=recoverable_oil)) +
geom_histogram()

Here we see clearly that most finds tend to be small, but we have a “fat tail”, with a few super large fields. The outliers you see here are the Statfjord and Ekofisk fields, which were also the first two fields to be found.

Now lets say we want to create a new variable: extracted, which is the difference between the total amount of recoverable oil in each field and the remaining oil in the field (this data is a few years old so the amounts have changed a bit.)

oil_fields$extracted=oil_fields$recoverable_oil-oil_fields$remaining_oil

Now lets create another variable which gives us amount of investment per extracted SM3 of oil:

oil_fields$invest_per_millsm3 = oil_fields$total.invest/oil_fields$extracted
#or
#oil_fields["invest_per_millsm3"]=oil_fields$total.invest/oil_fields$extracted

In the field above, notice that I have written two ways of creating a new variable (corresponding to two ways of accessing a certain variable/column). You can either use the $ marker or use block quotes with the column name in quotes: [“var name”]. The latter method works for any type of variable name, where the former only works if the variable is one word.

Lets try a plot

ggplot(oil_fields, aes(x=recoverable_oil, y=invest_per_millsm3)) +
  geom_point()

What does this plot tell us? Are there problems with the variables we have chosen to plot and the way we have chosen to plot them?

Lets try plotting investment per extracted SM3 of oil against the start of production date.

ggplot(oil_fields, aes(x=producing_from, y=log(invest_per_millsm3))) +
  geom_point()

Notice here that I transform the y-variable to log. This makes the data a bit more linear and easier to interpret. Basically you can judge the slope of a line as a rate of growth.

What we notice here is that investment per production is lowest in the oldest fields. This has a few different causes.

The most important is that investments (and costs) are front-loaded. A large portion of the costs of extracting oil and gas off-shore is in the initial build-out of the field and related infrastructure. The actual running costs once the field is up-and-running is relatively low. Thus, we see, that costs per extracted oil are exponentially higher for newer fields.

Neutrality of Norwegian Oil and Gas

This fact is very important in understanding Norwegian taxation policy. The marginal taxation rate on profits from oil and gas extraction is very high: nearly 80 percent: 22% corporation tax + 56% resource tax (plus other environmental fees, auction rights, etc).

However, petroleum extraction companies are given special taxation rules aimed at making investment in the petroleum industry “neutral.” That means, taxation is designed such that investments that would have been undertaken before tax, would also be taken pre-tax. A few hallmarks of the “neutral” tax system include:

  • Only the companies net profit (after costs, fees, etc) is taxable
  • Losses may be carried forward with interest
  • A company that has no taxable income, can carry forward losses to subsequent years.
  • Companies can also apply for a refund of the tax value of exploration costs.
  • Investments are subject to accelerated depreciation (“straight line” depreciation over 6 years.)

You can read more about taxation from the Norwegian Petroleum Institute

Exercises:

  • What does the special tax treatment imply about the initial investment costs of Norwegian offshore oil and gas investments?

  • Who is in effect paying for most of the upfront investment costs? Is this warranted/problematic?

  • If we know that there was a high risk that oil and gas production would not be profitable (or allowed) in the near future (10-30 years), would the tax system be neutral?

Large oil and gas fields

Lets take a look at the largest of the oil fields, the ones that have more than 50 million SM3 of recoverable reserves.

We use the function filter from tidyverse

big_oil_fields = oil_fields %>% filter(recoverable_oil>50)

#or
#big_oil_fields = oil_fields[oil_fields$recoverable_oil>50, ]

The filter command says take the data frame oil_fields and take only the rows where the variable recoverable_oil is more than 50.

There is a similar way to do this in base r, where you use the block quotes. Here the format is mydataframe[rows, columns], thus the code above says to include all rows where the given variable is above 50, and by leaving the columns portion blank, we tell r to include all columns.

Now I can get the names of all the big oil fields:

big_oil_fields$name
##  [1] "ÅSGARD"       "BALDER"       "BRAGE"        "DRAUGEN"      "EKOFISK"     
##  [6] "ELDFISK"      "GRANE"        "GULLFAKS"     "GULLFAKS SØR" "HEIDRUN"     
## [11] "NORNE"        "OSEBERG"      "OSEBERG SØR"  "SNORRE"       "STATFJORD"   
## [16] "TORDIS"       "TROLL"        "ULA"          "VALHALL"      "VESLEFRIKK"  
## [21] "VIGDIS"

But this is in alphabetical order, which isn’t helpful. Let’s rearrange by total recoverable oil. We will use the arrange function from tidyverse

big_oil_fields = arrange(big_oil_fields, desc(recoverable_oil))
big_oil_fields
##    ...1         name      lon      lat recoverable_oil remaining_oil
## 1  1220    STATFJORD 1.866338 61.26222           570.4           4.3
## 2   257      EKOFISK 3.218287 56.54663           569.2         129.8
## 3   971      OSEBERG 2.722425 60.54786           384.6          22.7
## 4   524     GULLFAKS 2.130364 61.20896           365.5          11.6
## 5  1432        TROLL 3.552057 60.82293           263.8          36.0
## 6  1194       SNORRE 2.193464 61.49258           249.9          64.1
## 7   620      HEIDRUN 7.306980 65.32944           182.1          40.1
## 8  1553      VALHALL 3.393803 56.27577           147.4          41.5
## 9   190      DRAUGEN 7.756043 64.33243           146.7          14.9
## 10  301      ELDFISK 3.242870 56.39870           137.9          37.3
## 11  504        GRANE 2.506074 59.18574           124.6          36.1
## 12   64       ÅSGARD 6.728537 65.11635           100.4          18.6
## 13  917        NORNE 8.147307 66.03421            90.8           3.8
## 14 1491          ULA 2.860582 57.10273            87.9          15.7
## 15   85       BALDER 2.446895 59.30631            72.1          16.2
## 16 1684       VIGDIS 2.147171 61.40055            66.6          15.0
## 17 1409       TORDIS 2.117935 61.28220            61.2           6.0
## 18 1020  OSEBERG SØR 2.760004 60.25226            61.0          17.2
## 19  121        BRAGE 3.014216 60.58058            59.3           4.0
## 20  557 GULLFAKS SØR 2.215286 61.10673            58.8          16.6
## 21 1616   VESLEFRIKK 2.891658 60.78827            54.1           1.8
##    producing_from total.invest
## 1      1979-11-24        65264
## 2      1971-06-15        94172
## 3      1988-12-01        66622
## 4      1986-12-22        74940
## 5      1995-09-19       130620
## 6      1992-08-03        65475
## 7      1995-10-18        52935
## 8      1982-10-02        61511
## 9      1993-10-19        23291
## 10     1979-08-08        34963
## 11     2003-09-23        19746
## 12     1999-05-19        65301
## 13     1997-11-06        23806
## 14     1986-10-06        14700
## 15     1999-10-02        23198
## 16     1997-01-28        14829
## 17     1994-06-03        12201
## 18     2000-02-05        20108
## 19     1993-09-23        17062
## 20     1998-10-10        28340
## 21     1989-12-26        16660
ggplot(big_oil_fields, aes(x=reorder(name, -recoverable_oil), y=recoverable_oil)) +
  geom_col() +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(x="", y="Recoverable Oil, SM3")

Notice a few things.

First I have used the reorder command within ggplot to tell ggplot to order from highest to lowest. Otherwise it would revert to alphabetical (try it!).

Second, I use the theme() function to rotate the text on the x-axis so it would be readable. theme() lets you change all sorts of details about your graph if you wish. I always end up having to search for the correct parameters to use, but the answer is usually not hard to find.

Now something a little bit more advanced

In this section I want to show some of the things you can do with R, but which would be hard or impossible with other software like Excel, Stata, etc. I won’t explain all the code, but feel free to go through carefully and see if you can understand the commands.

In our data set we have latitude and longitude of all our oil fields, so we can go ahead and try to make a map of our fields.

First, we will install a package that easily allows us to create maps, called ggmaps

We install the package.

install.packages("ggmap") 
#if necessary run: install.packages("devtools")
#library(devtools)
#devtools::install_github("dkahle/ggmap")

Then we also need to load the package.

library(ggmap)
## ℹ Google's Terms of Service: <]8;;https://mapsplatform.google.comhttps://mapsplatform.google.com]8;;>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.

ggmap uses map data from stamen maps. The command get_stamenmap() imports a map of a certain area.

northseaCoord= c(left = -5, bottom = 55, right = 10, top = 67)
northseaMap = get_stamenmap(northseaCoord, zoom = 5, maptype = "toner-lite")

Then we can combine our map with the data in our data frame:

ggmap(northseaMap) +
geom_point(aes(x = lon, y = lat, size=recoverable_oil, color=extracted),alpha=.7, data = oil_fields)+
scale_color_continuous(low="red", high="black")

Assignments and R markdown

In this course you will be regularily working with assignments as well as a term project. I want you to write these in R markdown format. This is a format that allows you to combine text, equations, code, and results all in one document. In fact, it is R markdown that I used to write these labs.

You can read a bit more about R Markdown here.

And download a cheat sheet here

You already have all the tools you need to get started with R-markdown. In R-studio, go to File -> New File -> R Markdown….

In the pop-up window you can give it an informative title

Choose pdf as desired output, since you will be handing in the output

Creating markdown is pretty straight-forward:

print("here is some code")
## [1] "here is some code"

There are many possible workflows for doing an analysis in R. But I have a preference for doing an analysis within a markdown text. I run chunks of code, then make notes about the results within the same document.

To render the document as a pdf click on the “knit” button in r-studio (or go to file -> Knit Document).

R-studio requires a version of tex/latex (a technical document rendering program) to create the pdf. If you don’t have it on your computer, the easiest way to install a minimal tex version is to run the following commands in r:

install.packages('tinytex')
tinytex::install_tinytex()

Assignment 1

  1. Create a variable invest_per_rec which is investment per million sm3 in recoverable oil (recoverable_oil). Plot these variables against producing_from variable and the original recoverable_oil variable. How do you interpret the results?

  2. Create a list of the 5 “cheapest” oil fields, that is where the investment is lowest per recoverable oil. What do these tend to have in common?

  3. I have a hypothesis that oil fields farther north are more expensive to exploit. Explore this hypothesis. Do you think it has merit?

  4. Open-ended question: Accessing and importing data

  1. Go to the data portion of the Norwegian Petrioleum Directorate

  2. The tabs at the top indicate the different types of data that is available by level/theme. Try to find some interesting dataset and download it as a .csv file. (Hint, on the left-hand panel, go down to “table view”, then you get a table of data, which you can export by clicking on “Export CSV”).

  3. Once you have downloaded the data, import the data into r using the read_csv() command.

  4. If there is a date variable, format that variable as a date (if read_csv() hasn’t automatically done so already)

  5. Plot the data in a meaningful way. Interpret the plot. Is there anything puzzling about the data.

(This question is vague and open-ended. I know, that is the point. It reflects the real-life messiness of starting an analysis with a blank page. Don’t get too frustrated, and try to have some fun.)

p.s. The field or wellbore tabs might be the most interesting tabs to explore. For the data sets in the field tab, the date is provided as separate columns for month and year. You probably want to combine these into a single date variable. Here is some code to help (I told you working with dates can be tricky):

#Install package lubridate if you need to: 
#install.packages("lubridate")
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
#DF: the name of my dataframe
#month: name of column with month data
#year: name of column with year data

#create a data frame with fake data
DF = tibble(
  "month" = 1:6,
  "year" = 2012:2017
)

#convert to strings

DF["month"] = as.character(DF$month)
DF["year"] = as.character(DF$year)

DF$date = str_c("1", DF$month, DF$year, sep="-") %>% dmy() %>% as.Date()
#DF
DF
## # A tibble: 6 × 3
##   month year  date      
##   <chr> <chr> <date>    
## 1 1     2012  2012-01-01
## 2 2     2013  2013-02-01
## 3 3     2014  2014-03-01
## 4 4     2015  2015-04-01
## 5 5     2016  2016-05-01
## 6 6     2017  2017-06-01

Solutions

1.)

From our measure of cost per recoverable oil, it looks like costs have increased over time. Cost escalation was a central problem in the mini-crisis that happened after oil prices collapsed 2013-2015. Oil companies had to work hard to bring down their costs and become more disciplined.

Our measure of costs however is not without problems or bias. Older fields tend to be larger and will naturally have some economies of scale to extraction. We can see this in the plot below of costs per recoverable oil to total amount of recoverable oil.

plot(oil_fields$recoverable_oil, oil_fields$invest_per_rec)

2.) Now we will sort by the five cheapest oil fields using the arrange() command and list the five cheapest

oil_fields = oil_fields %>% arrange(invest_per_rec)

print(oil_fields[1:5,])
##   ...1          name      lon      lat recoverable_oil remaining_oil
## 1  118         BØYLA 1.848627 59.32033             3.4           3.4
## 2  752         KNARR 2.781421 61.80114            11.9          11.9
## 3 1081 RINGHORNE ØST 2.494802 59.29083            15.5           6.3
## 4  146      BRYNHILD 2.297317 57.49330             3.6           3.6
## 5  802  MARTIN LINGE 2.054545 60.50824             6.0           6.0
##   producing_from total.invest invest_per_rec
## 1           <NA>           85       25.00000
## 2           <NA>          534       44.87395
## 3     2006-03-19          738       47.61290
## 4           <NA>          210       58.33333
## 5           <NA>          402       67.00000

%One thing that should pop out is that most of these fields were not producing at the time the data was collected. This leads to an artificially low investment cost. In an analysis, you might therefor wish to either remove these data points or at least account for this.

3.)

To look at our hypothesis that the latitude of a field might play a role in expense, we can start with a simple plot

plot(oil_fields$lat, oil_fields$invest_per_rec)

There is no clear pattern here, but a more detailed analysis should probably be used here. A multivariate regression might, for example, be able to isolate any effect that latitude has.

4.) Obviously, answers to this question will vary. But here is some simple cleaning of exploration/drilling data, followed by graphing the exploration by the operator:

library(lubridate)

exploration = read_csv("https://jmaurit.github.io/analytics/labs/data/wellbore_exploration_last_10_years.csv")
exploration["wlbCompletionDate"] =as.Date(exploration$wlbCompletionDate, format = "%d.%m.%Y")

exploration["wlbEntryDate"] =as.Date(exploration$wlbEntryDate, format = "%d.%m.%Y")
exploration["year"] = year(exploration$wlbEntryDate)
exploration %>% filter(wlbPurpose=="WILDCAT") %>% 
  ggplot(aes(x=wlbEntryDate, y=wlbDrillingOperator)) +
  geom_point()

One thing to notice is that a few companies consistently drill the most. Statoil/Equinor is of course notable (and notice that the name change hasn’t been corrected in the data.) But also smaller firms like Lundin are consistently active. Lundin is a small/mid-sized firm with its roots in Sweden. It discovered the major Sverdrup field in 2010 - the largest discover in several decades on the Norwegian Continental Shelf.