Probability, bayesian statistics and decision analysis¶

MET 430¶

NTNU Business School¶

Johannes Mauritzen¶

Learning goals¶

  • Understand bayes formula and how to combine prior information into an analysis
  • Understand the key differences between a frequentist and bayesian perspective on uncertainty and inference
  • Understand the practical advantages and disadvantages of bayesian statistics
  • Understand how to create a simple decision analysis model

Literature¶

  • Think Bayes Ch. 1 for a review of probability
  • Think Bayes Ch. 2 for Bayes Theorem
  • Optional Quan Econ: Exchangeability and Bayesian Updating

Bayes Theorem¶

Bayesian statistics and Bayesian analysis has becoming increasingly popular tool for both machine learning and more traditional regression applications. But it is not new. In fact, the mathematics and philosophy of Bayesian analysis goes back 300 years to the minister and amateur mathematician Thomas Bayes, who gives his name to the powerful Bayes' Theorem.

Though 300 years old, Bayes' theorem is still a powerful, if sometimes tricky tool for analysis and even daily decision making.

An application of Bayes' Theorem: When is a positive corona test really positive?¶

Here is a probability problem from recent real life:

Around Christmas 2021 my son and I met my son's cousins together with my brother-in-law. The next day, I get a phone call from my brother-in-law, who tells me that his oldest daughter had awoken with a cough and had tested positive for corona with a home quick-test ("Antigen test"). They would take her to get a more reliable PCR test, but it would take a day or two for the results to come in. If the results were confirmed, then we would be considered to be close contacts, and must spend 7 days in quarantine over Christmas. Bummer.

But I knew that these tests were not perfect, and people regularly test positive who do not have corona and some test negative when they actually do have corona. So, the probability question I then had was, what was the probability that my niece actually had corona given that she had tested positive for Corona with a quick test. Let's write this mathematically:

$$P(Corona=1|Test=1)$$

Above, we have written a conditional probability, which we can read The probability of corona given that the test was positive.

So first, I need to find some basic information. I searched a bit to find studies of how accurate these quick tests are. I found some information on the reliable Cochrane Reports, which is a non-profit organisation with the goal of providing well researched and financially independent medical and health information.

From the Cochrane reports, a study finds that with people who are confirmed to have corona, 72% of tests were positive (given that the person had symptoms). So is this our answer? Not quite, we can write this probability as:

$$P(Test=1|Corona=1)=.72$$

That is, the probability of a test being positive, given that a person has Corona is .72. It is the mirror image of what we want to find out.

But we can still estimate our wished-for probability, but first we need some more information.

We first need information on the general prevalence of corona in the population. The probability that I am looking for is, before the test results, what is the overall chance that my niece had corona, rather than just a common cold, flu or a light cough?

At the time, the number of registered corona cases was relatively low - around 4-5000 registered new cases per day. So it seems fairly conservative to say that at any given time, 1/100 people had the disease in Norway at that time. But perhaps if you consider that there are more children in that period who were getting sick, and the fact that my niece had been showing some symptoms, perhaps the probability should be closer to 1/10. Since we are a bit uncertain, we could then create a range of values and say that the prior probability of corona is between:

$$P(Corona=1) = .01$$

and

$$P(Corona=1)=.1$$

That is, before the information we have on the test, we might guess that her chance of having corona is between 1/100 and 1/10.

A final piece of information we need is the rate of false-positives. That is, the probability of someone that does not have corona still testing positive. This is referred to what is called the "specificity" of the test, which measures the probability of someone who is negative actually testing negative. The data for this seems to indicate a very high specificity - close to a 100 percent for most tests. To be conservative, we can assume a specificity of .99:

$$P(Test=0|Corona=0) = .99$$

Thus, we can find the rate of false positives as:

$$P(Test=1 | Corona=0) = 1-P(Test=0|Corona=0) = .01$$

Thus, the total probability of testing positive is just the weighted sum of the probability of testing positive if a person actually is positive (weighed by the proportion of people who actually have corona), and the probability of testing positive if a person is actually negative (weighted by the proportion of people who do not have corona). So in the scenario where we think the prior probability of having corona is 1/10:

$$P(Test=1) = P(Test=1 | Corona=0)*P(Corona=0) + P(Test=1|Corona=1)*P(Corona=1) = .01*(9/10) + .72*(1/10)$$

We can calculate both high and low scenarios

In [ ]:
PT_low =  .01*(99/100) + .72*(1/100)
PT_high = .01*(9/10) + .72*(1/10)
print(PT_low, PT_high)
0.0171 0.08099999999999999

Inverse probability and Bayes' theorem.¶

We now have enough information to calculate our desired probability: $P(Corona=1|Test=1)$. Bayesian analysis has sometimes been called inverse probability, since we are deducing the inverse of the information we have, $P(Test=1|Corona=1)$. To find the inverse probability, we use the formula from Thomas Bayes, now called Bayes' Formula or Bayes' Theorem:

$$P(A|B) = \frac{P(A)*P(B|A)}{P(B)}$$

Or specifically, in our case:

$$P(Corona=1|Test=1) = \frac{P(Corona=1)*P(Test=1|Corona=1)}{P(Test=1)}$$

We can then create a low and high scenario:

$$P(Corona=1|Test=1) = \frac{.01*.72}{.017}$$$$P(Corona=1|Test=1) = \frac{.1*.72}{.08}$$
In [ ]:
PCgT_low = (.01*.72)/PT_low
PCgT_high = (.1*.72)/PT_high

print(PCgT_low, PCgT_high)
0.42105263157894735 0.888888888888889

So depending on what prior information we include, our probability of actually having corona could be as low as 42% or as high as 88%

Bayesian analysis is just counting¶

One of the key features and advantages of Bayesian analysis is the ability to include prior information, in this case, that was the information on the probability of having Corona before the information from the test. The result, especially under the low scenario may seem surprising. That a positive result may give us less-than a 50% probability that we actually have corona. But we could have come to this probability simply by counting out the possible scenarios.

  • If we think that 1000 people take a test that day and 990/1000 (99%) people don't actually have corona and take the test then, knowing the information about the false positive results then 990*.01 = 9.9 of those people will test positive (falsely).

  • Of the 10/1000 people who have corona, .72*10 = 7.2 will test positive.

  • So out of about 17 positive tests, only 7.2/17 = .42 actually come from someone that has corona.

This shows the power of being able to include prior information into an analysis (the chance of having corona prior to taking the test), as well as the intuitive power of Bayes.

Did my niece have Corona?¶

Yes, the PCR test came out positive and I spent 7 days in quarantine, but I did not get Corona (then).

Diachronic Bayes and bayesian statistics¶

One way we could interpret the scenario above, is that we start with our initial information: that my niece wakes up with a cough, and an initial probability of between 1/10 and 1/100 that she has Corona.

We then get in some new data: the test results, which will update our probability, which we then can refer to as our posterior probability

We could continue in this way. In our discussion above, my niece took a PCR test, and of course this also is subject to uncertainty. So we could have treated our previous posterior probability as our new prior information, and then updated our probability again with the PCR test.

The Monty Hall problem¶

We'll consider one more simple, yet devilishly unintuitive problem called the Monty Hall problem. Here is a description of the problem from Downey:

The Monty Hall problem is based on a game show called Let's Make a Deal. If you are a contestant on the show, here's how the game works: The host, Monty Hall, shows you three closed doors -- numbered 1, 2, and 3 -- and tells you that there is a prize behind each door. One prize is valuable (traditionally a car), the other two are less valuable (traditionally goats). The object of the game is to guess which door has the car. If you guess right, you get to keep the car. Suppose you pick Door 1. Before opening the door you chose, Monty opens Door 3 and reveals a goat. Then Monty offers you the option to stick with your original choice or switch to the remaining unopened door.

To maximize your chance of winning the car, should you stick with Door 1 or switch to Door 2?

To answer this question, we have to make some assumptions about the behavior of the host:

  1. Monty always opens a door and offers you the option to switch.

  2. He never opens the door you picked or the door with the car.

  3. If you choose the door with the car, he chooses one of the other doors at random.

The intuitive answer for many people is to say that you still have a 50/50 chance and there is no advantage to switching. But this is wrong! Enumerating the possibilities, and making sure to include the information that the host uses in choosing a door to open will convince you that you should switch!

You could try a decision tree to convince yourself of this, or, as below we will follow downey in creating a Bayes table.

Below we create a table in the form of a Pandas data frame:

In [ ]:
import pandas as pd
from fractions import Fraction 
#so that we can use actual fractions (1/3) rather than floating point approximations (.33...)

We start by looking at our three options and the prior probability that the prize is in each door.

In [ ]:
MHtable = pd.DataFrame(index=['Door 1', 'Door 2', 'Door 3'])
MHtable['prior'] = Fraction(1, 3)
MHtable
Out[ ]:
prior
Door 1 1/3
Door 2 1/3
Door 3 1/3

Now we get the new "data", which is that monty opens door 3 to reveal a goat.

So now we consider the likelihood of our scenario that is: the probability of the data given our hypothesis:

  • If the car is behind Door 1 (the door you picked), Monty chooses Door 2 or 3 at random, so the probability he opens Door 3 is 1/2 .
  • If the car is behind Door 2, Monty has to open Door 3 (because you picked door 1, and 2 has the car), so the probability of the data under this hypothesis is 1.
  • If the car is behind Door 3, Monty does not open it, so the probability of the data under this hypothesis is 0 (since monty did in fact open door 3).
In [ ]:
MHtable['likelihood'] = Fraction(1, 2), 1, 0
MHtable
Out[ ]:
prior likelihood
Door 1 1/3 1/2
Door 2 1/3 1
Door 3 1/3 0

So now the last step is to update our Bayes table according to Bayes rule:

$$P(A|B) = \frac{P(A)*P(B|A)}{P(B)}$$

So we could write the probability that the car is behind door 1 given that door 3 is opened as:

$$P(Door1|Door3opened) = \frac{P(Door1)*P(Door3|Door1)}{P(Door3)}$$$$P(Door1|Door3) = \frac{(1/3)*(1/2)}{(1/3)*(1/2) + (1/3)*1} = 1/3$$$$P(Door1|Door3) = \frac{(1/3)*(1/2)}{(1/3)*(1/2) + (1/3)*1} = 1/3$$

The following function will do the Bayesian updating in our table automatically

In [ ]:
def update(table):
    """Compute the posterior probabilities."""
    table['unnorm'] = table['prior'] * table['likelihood']
    prob_data = table['unnorm'].sum()
    table['posterior'] = table['unnorm'] / prob_data
    return prob_data
In [ ]:
update(MHtable)
MHtable
Out[ ]:
prior likelihood unnorm posterior
Door 1 1/3 1/2 1/6 1/3
Door 2 1/3 1 1/3 2/3
Door 3 1/3 0 0 0

Bayesian analysis and statistical analysis¶

Bayesian statistics and Bayesian regression is, in the simplist sense, the further application of the probabilistic thinking and enumeration we went through in the above examples and applying them to statistical analysis and regression.

This means:

  • We can include prior information in our estimation

  • We treat parameters as uncertain, which in most modern applications we estimate through simulation

  • We can easily update our estimates when new data become available

  • We can propogate uncertainty to derivative values - like predictions or, for example, measures of costs or benefits.

Compare this to traditional frequentist methodology, which is what is mostly taught in introduction courses

  • It is not possible to make use of outside, or prior information - all relevant information is assumed to be in the data

  • Estimated coefficients are assumed to be fixed - these are treated simply as calculated values from the sample. They are compared to a null hypothesis and/or null model , from which we judge whether these values are "statistically significant".

  • It can be difficult to calculate uncertainty in derivative values, like predictions, and we must often use forms of simulation here as well, such as bootstrapping.

Other sources for learning Bayesian analysis¶

Bayesian analysis and statistics is a big and growing field. Several excellent introductory texts exists to learn more for those interested:

  • Downey, Think Bayes (python)
  • Kruschke, Doing Bayesian Analysis (R, Stan)
  • McElreath, Statistical Rethinking (R, Stan)
  • Gelman et al, Bayesian Data Analysis (Technical, R, Stan)

Propogation of uncertainy and decision analysis¶

One important application of Bayesian analysis is within what is called Decision Analysis. The basic idea of decision analysis is to propogate the uncertainty from an estimation (like a regression) into some meaningful function that affects a decision, strategy or operation.

Here we will show an example using electricity data and a made-up decision on whether an electricity monopolist should invest in more wind power.

We won't use actual Bayesian estimation, instead relying on OLS. But we will make use of the simulated posterior uncertainty.

The data is real, but the model is very simplified.

In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as spt

import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns

from cycler import cycler

plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
plt.rcParams["axes.labelsize"]= 12
plt.rcParams["figure.facecolor"] = "#f2f2f2"
#plt.rcParams['figure.savefig.dpi'] = 100
plt.rcParams['savefig.edgecolor'] = "#f2f2f2"
plt.rcParams['savefig.facecolor'] ="#f2f2f2"
plt.rcParams["figure.figsize"] = [16,10]
plt.rcParams['savefig.bbox'] = "tight"
plt.rcParams['font.size'] = 14
greens = ['#66c2a4','#41ae76','#238b45','#006d2c','#00441b']
multi =['#66c2a4','#1f78b4','#a6cee3','#b2df8a','#33a02c','#fb9a99','#e31a1c','#fdbf6f']
plt.rcParams["axes.prop_cycle"] = cycler(color=multi)

we'll load in data on the Nordic electricity market at an hourly frequency

In [ ]:
elDF = pd.read_csv("http://jmaurit.github.io/analytics/labs/data/wt_data2.csv")
In [ ]:
elDF
Out[ ]:
date time hour wind_SE1 wind_SE2 wind_SE3 wind_SE4 wind_DK1 wind_DK2 SE_nx ... PLAtoPL_cap PLtoPLA_cap SE4toLT_cap LTtoSE4_cap SYSEurMW SE4EurMW DK1EurMW DK2EurMW hour_ind month
0 2016-01-01 00:00:00 2016-01-01 00:00:00 00 - 01 420.0 1247.0 432.0 208.0 1314.0 113.0 -2621.0 ... 0.0 3600.0 0.0 0.0 1639.0 1639.0 1639.0 1639.0 0 m1
1 2016-01-01 00:00:00 2016-01-01 01:00:00 01 - 02 416.0 1214.0 419.0 187.0 1267.0 98.0 -2274.0 ... 0.0 3900.0 0.0 0.0 1604.0 1604.0 1604.0 1604.0 1 m1
2 2016-01-01 00:00:00 2016-01-01 02:00:00 02 - 03 417.0 1227.0 378.0 177.0 1159.0 74.0 -2370.0 ... 0.0 3900.0 0.0 0.0 1574.0 1574.0 1574.0 1574.0 2 m1
3 2016-01-01 00:00:00 2016-01-01 03:00:00 03 - 04 424.0 1232.0 357.0 173.0 1160.0 61.0 -2541.0 ... 0.0 4000.0 0.0 0.0 1557.0 1557.0 1557.0 1557.0 3 m1
4 2016-01-01 00:00:00 2016-01-01 04:00:00 04 - 05 412.0 1245.0 326.0 161.0 1069.0 47.0 -2616.0 ... 0.0 4100.0 0.0 0.0 1547.0 1547.0 1547.0 1547.0 4 m1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
26299 2018-12-31 00:00:00 2018-12-31 19:00:00 19 - 20 NaN NaN NaN NaN 3335.0 800.0 -5452.0 ... 2019.0 1011.0 700.0 281.0 4883.0 4751.0 4751.0 4751.0 19 m12
26300 2018-12-31 00:00:00 2018-12-31 20:00:00 20 - 21 NaN NaN NaN NaN 3147.0 865.0 -4919.0 ... 976.0 1551.0 700.0 220.0 4723.0 4582.0 4582.0 4582.0 20 m12
26301 2018-12-31 00:00:00 2018-12-31 21:00:00 21 - 22 NaN NaN NaN NaN 2859.0 902.0 -4493.0 ... 208.0 2229.0 700.0 204.0 4602.0 4303.0 4303.0 4303.0 21 m12
26302 2018-12-31 00:00:00 2018-12-31 22:00:00 22 - 23 NaN NaN NaN NaN 2708.0 792.0 -4165.0 ... 0.0 2800.0 700.0 148.0 4555.0 3854.0 3854.0 3854.0 22 m12
26303 2018-12-31 00:00:00 2018-12-31 23:00:00 23 - 00 NaN NaN NaN NaN 2719.0 785.0 -3915.0 ... 0.0 3127.0 700.0 119.0 4269.0 2570.0 2570.0 2570.0 23 m12

26304 rows × 90 columns

Let's consider the simplist of regressions between the price of wholesale electricity in one of the Danish price areas (DK1) and the amount of wind power production.

(By the way, I would not recommend the below as a good regression model for this data.

In [ ]:
mod1wind = smf.ols("DK1EurMW ~ wind_DK1", data=elDF).fit()
In [ ]:
mod1wind.summary()
Out[ ]:
OLS Regression Results
Dep. Variable: DK1EurMW R-squared: 0.101
Model: OLS Adj. R-squared: 0.101
Method: Least Squares F-statistic: 2949.
Date: Thu, 11 May 2023 Prob (F-statistic): 0.00
Time: 13:52:57 Log-Likelihood: -2.2683e+05
No. Observations: 26301 AIC: 4.537e+05
Df Residuals: 26299 BIC: 4.537e+05
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 3958.6249 13.804 286.771 0.000 3931.568 3985.682
wind_DK1 -0.5077 0.009 -54.305 0.000 -0.526 -0.489
Omnibus: 2763.096 Durbin-Watson: 0.096
Prob(Omnibus): 0.000 Jarque-Bera (JB): 6128.193
Skew: 0.652 Prob(JB): 0.00
Kurtosis: 4.973 Cond. No. 2.45e+03


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.45e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

So from the the regression model we see a negative coefficient on the wind power variable. On average, an extra megawatt-hour (MWH) (a standard unit of electricity production) of wind power wil lower prices by .51 Euro/MWH.

Defining the business problem¶

Let us say that a Danish monopolist electricity utility is deciding whether to invest in a new large off-shore wind park.

  • The size of the wind park is 400 megawatts
  • The average capacity factor is 50%. Thus in a typical hour, the park would be producing 400*.5 = 200 MWH of electricity.
  • The investment cost of the park is 300,000,000 (300 million euro)
  • We assume no running costs (marginal costs) for the wind power plant
  • The utility currently operates wind power of 500 megawatts (with the same capacity factor as above, and with no operating costs).
  • The utility also owns 1000 MW of thermal capacity (gas, coal, nuclear), which we assume has a 100% capacity factor
  • We consider a period of 1000 hours, without any NPV discount

We want to run an analysis to determine whether the monopoly should make the investment (we assume they are profit maximizing).

(all these numbers are completly unrealistic)

Posterior simulations from the regression model¶

We'll start by simulating from the "posterior" of our regression model (posterior is in quotes, because formally when we refer to the posterior of a model, then we should have a bayesian model, which we don't.)

What the below block of code does is take the point estimates and variance-covariance matrix from the regression, and produce a simulation of a posterior distribution assuming a normal distribution. You can look through the code, but don't worry if you don't understand all the steps.

In [ ]:
def regSim(regMod): 
    #extract values from regression model
    nmk = regMod.df_resid #n-k
    sigma_hat = np.sqrt(regMod.mse_resid)
    bs_vcov = regMod.cov_params(scale=1)
    bs = regMod.params
    
    #create simulated values
    sigma_sim = sigma_hat*np.sqrt((nmk/np.random.chisquare(nmk,1)))
    V_sim = np.array(bs_vcov) * sigma_sim**2
    bs_sim = np.random.multivariate_normal(bs, V_sim, 1)
    
    return([bs_sim.flatten(), sigma_sim])

Now we simulate parameters from our regression models 1000 times, also similar to what we have done in earlier labs

In [ ]:
sigma_hat = np.sqrt(mod1wind.mse_resid) #estimate of sigma, sigma_hat

nsim = 1000

bs_sims = []
sigma_sims = []

for s in range(nsim):
    param_sim = regSim(mod1wind)
    bs_sims.append(param_sim[0])
    sigma_sims.append(param_sim[1])
In [ ]:
simsDF = pd.DataFrame(bs_sims)
In [ ]:
simsDF
Out[ ]:
0 1
0 3971.254744 -0.514325
1 3970.117303 -0.506004
2 3956.200387 -0.510918
3 3950.656899 -0.505535
4 3941.609104 -0.493350
... ... ...
995 3975.594169 -0.514547
996 3951.050708 -0.492996
997 3965.325493 -0.505766
998 3966.999980 -0.512246
999 3962.317763 -0.510511

1000 rows × 2 columns

We can visualise the predictive distribution of our model as we have also done previously

In [ ]:
x_line = np.arange(0,4000,1)

fig, ax = plt.subplots()
elDF.plot.scatter(x="wind_DK1", y="DK1EurMW", ax=ax, alpha=.2)
for i, row in simsDF.iloc[:100,:].iterrows():
    y_line = row[0] + row[1]*x_line + np.random.normal(0,sigma_hat)
    ax.plot(x_line, y_line, color="grey", alpha=.5)

Now we'll use our simulations to create distributions for our expected prices with and without the investment.

  • For the no-investment scenario, we will fix the wind power variable at its mean value for simplicity.

  • For the investment scenario, we add an expected .5*400=200 MWH of wind production

In [ ]:
meanWind = np.mean(elDF.wind_DK1)
meanWind #mean amount of wind 

newWind = .5*400
expectedWind = meanWind + newWind
In [ ]:
simsDF["simPrices"] = simsDF.iloc[:,0] + simsDF.iloc[:,1]*meanWind + np.random.normal(0,sigma_hat, nsim)
simsDF["simPricesHighWind"] =  simsDF.iloc[:,0] + simsDF.iloc[:,1]*expectedWind + np.random.normal(0,sigma_hat, nsim)

We can now look at the histogram of the expected prices for the two scenarios

In [ ]:
fig, ax = plt.subplots()
simsDF.simPrices.hist(bins=100, ax=ax, label="No investment", alpha=.5)
simsDF.simPricesHighWind.hist(bins=100, ax=ax, label="investment", alpha=.5)
ax.set_xlabel("Predicted prices")
ax.legend()
Out[ ]:
<matplotlib.legend.Legend at 0x7fc0c2125e50>

Profit function¶

Now we define our profit function, which will take the simulated prices and output an expected profit for the two scenarios.

First we define some parameters

In [ ]:
InvestmentCost = 300000000 #investment cost

ExistingWindCap = 500 #How much existing wind capacity the company owns
NewWind = 400 #Size in capacity of new investment
cfWind = .5 #capacity factor of wind power
#no running/marginal cost for wind

thermalCap = 1000 #How much thermal capacity the company owns
mcThermal = 2000 #The marginal cost of the thermal capacity owned by the company
n = 1000 #1000 periods, relevant economic period to evaluate profitability. 

#only run thermal if price>mc

Now we are ready to create a profit function.

  • The function takes as input a row of the simulation data frame.

  • We calculate a variable for operating profit for both scenarios (operProfit, operProfitHighWind). The operating profit is market price times the expected amount of wind power produced. In addition, if the price is above the marginal cost of the thermal generation, then the company also gets operating profit from the difference between the market price and the marginal cost.

  • The total profit for the no-invesment scenario is simply a multiple of the operating profit times the number of periods that we consider (n=1000).

  • For the investment scenario, we must subtract the Investment cost variable.

  • The function returns a row with two new columns representing the expected profit for both scenarios

In [ ]:
def profitFunction(row):
    #calculate profit without new investment
    operProfit = row.simPrices*ExistingWindCap*cfWind
    if row.simPrices > mcThermal:
        operProfit += (row.simPrices-mcThermal)*thermalCap
    row["profit"] = operProfit*n #multiply predicted operating profit by number of periods
    
    #calculate  profit with new investment
    operProfitHighWind = row.simPricesHighWind*(ExistingWindCap+newWind)*cfWind
    if row.simPricesHighWind > mcThermal:
        operProfitHighWind += (row.simPricesHighWind-mcThermal)*thermalCap
    row["profitHighWind"] = operProfitHighWind*n - InvestmentCost
    return(row)
In [ ]:
newSimsDF = simsDF.apply(profitFunction, axis=1)
In [ ]:
newSimsDF
Out[ ]:
0 1 simPrices simPricesHighWind profit profitHighWind
0 3952.788759 -0.493896 5012.876829 3142.103835 4.266096e+09 1.941840e+09
1 3957.635557 -0.503402 3282.681386 712.747128 2.103352e+09 -5.053851e+07
2 3957.711510 -0.503402 1146.672129 2941.187571 2.866680e+08 1.670603e+09
3 3933.209222 -0.491248 3878.110936 2215.606619 2.847639e+09 6.910689e+08
4 3969.419491 -0.504066 1409.517923 4814.874590 3.523795e+08 4.200081e+09
... ... ... ... ... ... ...
995 3956.120951 -0.507384 3659.349398 3047.835291 2.574187e+09 1.814578e+09
996 3966.880672 -0.515162 4529.915873 3013.219574 3.662395e+09 1.767846e+09
997 3989.846916 -0.523470 3797.144854 3008.258164 2.746431e+09 1.761149e+09
998 3979.523098 -0.521184 3466.670350 3837.460831 2.333338e+09 2.880572e+09
999 3959.795794 -0.511533 2441.090415 2737.520674 1.051363e+09 1.395653e+09

1000 rows × 6 columns

We can now look at the distributions of potential profit scenarios

In [ ]:
fig, ax = plt.subplots()
newSimsDF.profit.hist(bins=100, ax=ax, label="No investment", alpha=.5)
newSimsDF.profitHighWind.hist(bins=100, ax=ax, label="investment",alpha=.5)
ax.set_xlabel("Predicted profit")
ax.legend()
Out[ ]:
<matplotlib.legend.Legend at 0x7fc0c1c27e50>

Decision rule¶

Now that we have a distribution of possible profits for the two scenarios, we need to devise a criteria for making our decision.

A straight-forward criteria would be to compare the expectation (aka mean) profit, and see which is higher.

In [ ]:
expectedProfit = newSimsDF.profit.mean()
expectedProfitHighWind = newSimsDF.profitHighWind.mean()
In [ ]:
print("Expected Profit, no investment", expectedProfit)
print("Proift, investment",expectedProfitHighWind)
Expected Profit, no investment 2436272596.907039
Proift, investment 2286089024.467252
In [ ]:
expectedProfitHighWind>expectedProfit
Out[ ]:
False

Under this rule we would not make the investment. But this is by no means the only viable decision rule. We could, for example, also look at which decision will lead to smallest probability of loss.

In [ ]:
nLoss  = np.sum(newSimsDF.profit<0)/nsim
nLossHighWind = np.sum(newSimsDF.profitHighWind<0)/nsim
print("Probability of loss, no investment", nLoss*100)
print("Probability of loss, investment", nLossHighWind*100)
Probability of loss, no investment 0.2
Probability of loss, investment 3.4000000000000004

Here the probability of a loss is higher if we invest

Is the wind park in itself profitable?¶

Now imagine a scenario where an independent power producer wanted to come into the market and invest in the wind park. Then we could analyse the profitability of the wind park itself, without taking into account the effect on the other generators.

We do this below

In [ ]:
windParkNetProfits = newSimsDF.simPricesHighWind*NewWind*cfWind*n-InvestmentCost
fig, ax = plt.subplots()
windParkNetProfits.hist(bins=50, ax=ax)
ax.set_ylabel("Net profit of wind park independently")
Out[ ]:
Text(0, 0.5, 'Net profit of wind park independently')

PYMC (optional)¶

In the above example, we have run a regression using OLS, and then simulated from the results, giving a Bayesian interpretation to a frequentist estimation. This is a limited way of doing a Bayesian-(like) analysis. In order to estimate truly Bayesian models where we can incorporate prior information and propogate uncertainty we need to make use of a simulation methodology called Markov-Chain Monte Carlo (MCMC). The details are beyond the scope of this course. But we call still make use of tools for Bayesian analysis using MCMC in Python.

The main package for Bayesian analysis and regression using MCMC in Python is called PYMC.

A technical warning¶

PyMC, like other MCMC simulation tools, makes use of a C++ interpreter to run code and models. This means there are system requirements to have a C++ interpreter installed on your PC/MAC and correctly set-up. My experience is that this can be a source of many technical problems, which can be difficult to resolve.

We'll go through some installation instructions, but if you can't get PYMC installed and working, I might suggest simply skipping this portion of the lab. There is no expectation to include Bayesian analysis with MCMC simulation in a final project.

Installing PYMC¶

We'll follow the instructions on the PyMC website for installing.

The above assumes you have installed anaconda, which we should have if you have been following the course.

The environment is basically a workspace with different configurations of python. This means that every time you want to use pymc, you first need to activate the environment by writing in conda activate pymc3_env, or whatever you called it in your terminal. And when you want to quit that environment (and go back to your standard environment) you would write in conda deactivate

If you have a mac, you may also need to install Xcode (you can find it in the mac app store). Once you have installed Xcode, you may also need to install Command Line tools. This you can do by the following command in the terminal:

xcode-select --install 
In [ ]:
 

Creating a Bayesian model in PyMC¶

We'll first go through the Bayesian regression model as discussed in this tutorial to get a handle on using PyMC. You should read through the entire documentation.

Before we start, you may need to close your jupyter notebook and terminal. Then do the following steps:

Open a new terminal window

Activate your pymc environment by writing

conda activate pymc_env


You may also need to re-install jupyter notebooks for your pymc environment

conda install jupyter

Navigate to the file where you have your notebook, then open your notebook.

Then we should be ready to go

(Except if you get an error message about not being able to run iPython kernel in the pymc environment. You may then need to run your code directly in the terminal.)

In [ ]:
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
import xarray as xr

from pymc import HalfCauchy, Model, Normal, sample

print(f"Running on PyMC v{pm.__version__}")
In [ ]:
RANDOM_SEED = 8927
rng = np.random.default_rng(RANDOM_SEED)

%config InlineBackend.figure_format = 'retina'
az.style.use("arviz-darkgrid")

Toy data generation¶

Below a simple "fake" dataset is generated

In [ ]:
size = 200
true_intercept = 1
true_slope = 2

x = np.linspace(0, 1, size)
# y = a + b*x
true_regression_line = true_intercept + true_slope * x
# add noise
y = true_regression_line + rng.normal(scale=0.5, size=size)

data = pd.DataFrame(dict(x=x, y=y))

Here is a visualization of the data together with the "true" regression line

In [ ]:
fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(111, xlabel="x", ylabel="y", title="Generated data and underlying model")
ax.plot(x, y, "x", label="sampled data")
ax.plot(x, true_regression_line, label="true regression line", lw=2.0)
plt.legend(loc=0);

Estimating the model¶

Below is the standard pymc3 setup of a bayesian model.

Priors¶

It starts by defining prior distributions for the model: for the model uncertainty, sigma, the intercept term and the coefficient on the x term. These priors are designed to be fairly "weakly-informative" or "semi-informative", that is to say, we are not imposing much prior knowledge into the distributions, and only imposing some "regularization" - that is automatically discounting outlier data points. The bayesian estimation will probably be similar to maximum likelihood or OLS.

We then define the likelihood of the equation.¶

Here we are writing that we consider the y-variable to be normally distributed with mean values equal to the regression equation:

$$Y \sim N(\alpha + \beta x, \sigma)$$

Notice what this means: we are modelling each Y-value as having its own probability distribution, where with our typical modelling we give each Y-value a single "fitted value".

Sampler¶

Finally, we tell our MCMC algorithm (called a "sampler" in the sense that we are randomly sampling data from the probability model we have defined), to run 3000 simulations. Hopefully the below code will run for you:

In [ ]:
with Model() as model:  # model specifications in PyMC are wrapped in a with-statement
    # Define priors
    sigma = HalfCauchy("sigma", beta=10)
    intercept = Normal("Intercept", 0, sigma=20)
    slope = Normal("slope", 0, sigma=20)

    # Define likelihood
    likelihood = Normal("y", mu=intercept + slope * x, sigma=sigma, observed=y)

    # Inference!
    # draw 3000 posterior samples using NUTS sampling
    idata = sample(3000)

Output from a bayesian model.¶

At a practical level, a big difference between bayesian analysis and the tools we have been using is that we are not doing a form of optimization: that is finding parameters that maximize likelihood (Maximum likelihood) or minimize squared erros (OLS).

Instead, we are setting up a probability model (prior, likelihood/data etc) and then "sampling" from the resulting posterior distribution.

So instead of getting as output single point estimates for our coefficients, we get sampled values for each coefficient. Basically, we get as output distributions for each parameter.

We can use the package arviz (az) to plot one of the distributions of our three parameters (intercept, parameter on our x value and our sigma):

In [ ]:
az.plot_trace(idata, var_names="slope")
Out[ ]:
array([[<AxesSubplot: title={'center': 'slope'}>,
        <AxesSubplot: title={'center': 'slope'}>]], dtype=object)

If we wanted to generate a point estimate, we could, for example take the mean value of our distribution for a given paramaeter - for example our coefficient on our slope variable:

In [ ]:
idata.posterior.slope
Out[ ]:
<xarray.DataArray 'slope' (chain: 4, draw: 3000)>
array([[2.07844938, 2.00946818, 1.98094884, ..., 1.92221807, 2.09151181,
        2.222756  ],
       [1.96990623, 2.05698143, 2.0651126 , ..., 2.09183843, 2.08851394,
        2.1722263 ],
       [2.25853188, 2.16465264, 1.72929357, ..., 1.82192198, 1.97196397,
        2.13588937],
       [2.13479654, 2.02128419, 2.03796066, ..., 2.0216993 , 1.86120875,
        2.24262356]])
Coordinates:
  * chain    (chain) int64 0 1 2 3
  * draw     (draw) int64 0 1 2 3 4 5 6 7 ... 2993 2994 2995 2996 2997 2998 2999
xarray.DataArray
'slope'
  • chain: 4
  • draw: 3000
  • 2.078 2.009 1.981 2.147 1.979 1.962 ... 2.083 2.038 2.022 1.861 2.243
    array([[2.07844938, 2.00946818, 1.98094884, ..., 1.92221807, 2.09151181,
            2.222756  ],
           [1.96990623, 2.05698143, 2.0651126 , ..., 2.09183843, 2.08851394,
            2.1722263 ],
           [2.25853188, 2.16465264, 1.72929357, ..., 1.82192198, 1.97196397,
            2.13588937],
           [2.13479654, 2.02128419, 2.03796066, ..., 2.0216993 , 1.86120875,
            2.24262356]])
    • chain
      (chain)
      int64
      0 1 2 3
      array([0, 1, 2, 3])
    • draw
      (draw)
      int64
      0 1 2 3 4 ... 2996 2997 2998 2999
      array([   0,    1,    2, ..., 2997, 2998, 2999])
    • chain
      PandasIndex
      PandasIndex(Int64Index([0, 1, 2, 3], dtype='int64', name='chain'))
    • draw
      PandasIndex
      PandasIndex(Int64Index([   0,    1,    2,    3,    4,    5,    6,    7,    8,    9,
                  ...
                  2990, 2991, 2992, 2993, 2994, 2995, 2996, 2997, 2998, 2999],
                 dtype='int64', name='draw', length=3000))
In [ ]:
np.mean(idata.posterior.slope)
Out[ ]:
<xarray.DataArray 'slope' ()>
array(2.05917285)
xarray.DataArray
'slope'
  • 2.059
    array(2.05917285)

      And we could generate a standard error by simply calculating the standard deviation

      In [ ]:
      np.std(idata.posterior.slope)
      
      Out[ ]:
      <xarray.DataArray 'slope' ()>
      array(0.12981255)
      xarray.DataArray
      'slope'
      • 0.1298
        array(0.12981255)

          We can also generate a table of summary statistics

          In [ ]:
          az.summary(idata.posterior, kind="stats")
          
          Out[ ]:
          mean sd hdi_3% hdi_97%
          Intercept 1.046 0.075 0.910 1.187
          slope 2.059 0.130 1.820 2.310
          sigma 0.524 0.026 0.474 0.574

          We can then plot posterior predictive chart

          In [ ]:
          with model:
              pm.sample_posterior_predictive(idata, extend_inferencedata=True, random_seed=rng)
          
          az.plot_ppc(idata, num_pp_samples=100)
          
          Sampling: [y]
          
          100.00% [12000/12000 00:00<00:00]
          Out[ ]:
          <AxesSubplot: xlabel='y / y'>

          The posterior predictive check compares the distribution of the y-variable in the actual data with distributions of the data from the estimated data. If the two do not align well, then we probably have a big problem.

          Bayesian inference: Height and income¶

          Here we will run a regression comparing earnings and height, taken from Regression and Other Stories (ROS), using Bayesian MCMC.

          We start by downloading the dataset from the ROS website

          In [ ]:
          earnings = pd.read_csv("https://raw.githubusercontent.com/avehtari/ROS-Examples/master/Earnings/data/earnings.csv")
          
          earnings
          
          Out[ ]:
          height weight male earn earnk ethnicity education mother_education father_education walk exercise smokenow tense angry age
          0 74 210.0 1 50000.0 50.0 White 16.0 16.0 16.0 3 3 2.0 0.0 0.0 45
          1 66 125.0 0 60000.0 60.0 White 16.0 16.0 16.0 6 5 1.0 0.0 0.0 58
          2 64 126.0 0 30000.0 30.0 White 16.0 16.0 16.0 8 1 2.0 1.0 1.0 29
          3 65 200.0 0 25000.0 25.0 White 17.0 17.0 NaN 8 1 2.0 0.0 0.0 57
          4 63 110.0 0 50000.0 50.0 Other 16.0 16.0 16.0 5 6 2.0 0.0 0.0 91
          ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
          1811 61 120.0 0 15000.0 15.0 White 18.0 18.0 18.0 6 1 2.0 0.0 0.0 82
          1812 64 130.0 0 8000.0 8.0 White 12.0 12.0 12.0 1 1 1.0 7.0 7.0 33
          1813 72 194.0 1 60000.0 60.0 White 12.0 12.0 12.0 2 1 2.0 0.0 0.0 50
          1814 63 155.0 0 15000.0 15.0 Other 14.0 14.0 14.0 6 1 2.0 2.0 2.0 69
          1815 68 150.0 1 6000.0 6.0 White 12.0 12.0 12.0 1 6 1.0 2.0 2.0 27

          1816 rows × 15 columns

          Let's start by transforming the variables in our analysis to be standardized - we subtract the mean and divide by the standard deviation

          In [ ]:
          earn_s = (earnings.earn - np.mean(earnings.earn))/np.std(earnings.earn)
          height_s =  (earnings.height - np.mean(earnings.height))/np.std(earnings.height)
          male = earnings.male
          

          Weakly informative analysis¶

          We start with an analysis with weakly informative priors

          In [ ]:
          with Model() as earnings_mod:  # model specifications in PyMC are wrapped in a with-statement
              # Define our data
              earnings = pm.MutableData("earnings", earn_s)
              isMale = pm.MutableData("isMale", male)
              height = pm.MutableData("height", height_s)
              # Define priors
              sigma = HalfCauchy("sigma", beta=10)
              intercept = Normal("Intercept", 0, sigma=10)
              beta_height = Normal("beta_height", 0, sigma=10)
              beta_male = Normal("beta_male", 0, sigma=10)
          
              # Define likelihood
              likelihood = Normal("y", mu=intercept + beta_height * height + beta_male*isMale, sigma=sigma, observed=earnings)
          
              # Inference!
              # draw 3000 posterior samples using NUTS sampling
              earnings_trace = sample(3000)
          
          Auto-assigning NUTS sampler...
          Initializing NUTS using jitter+adapt_diag...
          Multiprocess sampling (4 chains in 4 jobs)
          NUTS: [sigma, Intercept, beta_height, beta_male]
          
          100.00% [16000/16000 00:04<00:00 Sampling 4 chains, 0 divergences]
          Sampling 4 chains for 1_000 tune and 3_000 draw iterations (4_000 + 12_000 draws total) took 21 seconds.
          
          In [ ]:
          az.plot_trace(earnings_trace, var_names=["beta_height", "beta_male"])
          
          Out[ ]:
          array([[<AxesSubplot: title={'center': 'beta_height'}>,
                  <AxesSubplot: title={'center': 'beta_height'}>],
                 [<AxesSubplot: title={'center': 'beta_male'}>,
                  <AxesSubplot: title={'center': 'beta_male'}>]], dtype=object)

          Controlling for gender, the mean coefficient value can be estimated to be:

          In [ ]:
          np.mean(earnings_trace.posterior.beta_height)
          
          Out[ ]:
          <xarray.DataArray 'beta_height' ()>
          array(0.10952402)
          xarray.DataArray
          'beta_height'
          • 0.1095
            array(0.10952402)

              .11, which we can interpret to mean that a 1 std change in height leads to a .11 std increase in earnings.

              We can also calculate a 95% credible interval using the hdi (highest density interval) function in arViz:

              In [ ]:
              az.hdi(earnings_trace.posterior.beta_height,hdi_prob=.95)
              
              Out[ ]:
              <xarray.Dataset>
              Dimensions:      (hdi: 2)
              Coordinates:
                * hdi          (hdi) <U6 'lower' 'higher'
              Data variables:
                  beta_height  (hdi) float64 0.04898 0.1704
              xarray.Dataset
                • hdi: 2
                • hdi
                  (hdi)
                  <U6
                  'lower' 'higher'
                  hdi_prob :
                  0.95
                  array(['lower', 'higher'], dtype='<U6')
                • beta_height
                  (hdi)
                  float64
                  0.04898 0.1704
                  array([0.04898333, 0.17040501])
                • hdi
                  PandasIndex
                  PandasIndex(Index(['lower', 'higher'], dtype='object', name='hdi'))

              We can also easily generate a summary table:

              In [ ]:
              az.summary(earnings_trace)
              
              Out[ ]:
              mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
              Intercept -0.176 0.033 -0.237 -0.114 0.000 0.000 7191.0 8287.0 1.0
              beta_height 0.110 0.031 0.051 0.168 0.000 0.000 7159.0 7589.0 1.0
              beta_male 0.473 0.065 0.350 0.595 0.001 0.001 6665.0 7616.0 1.0
              sigma 0.950 0.016 0.921 0.980 0.000 0.000 7901.0 7768.0 1.0

              Where Bayes shines¶

              In practice, for many modelling tasks, Bayesian estimation will give similar results to maximum likelihood or OLS. But there are multiple applications where Bayesian analysis works particularly well and where it would be difficult or impossible to do the same analysis with maximum likelihood or OLS estimation. We go quickly through a few of these applications:

              • Integrate outside information through informative priors

              The most direct, and sometimes controversial, advantage of Bayesian analysis is the ability to integrate outside information through informative priors. As we have seen, it is a misconceptions that you need to have prior information. But if you have information, outside of the relevant dataset you are analysing, that you want to integrate into the analysis, you can do so through an informative prior. ROS discusses some examples of including informative priors in an analysis in 9.4 and 9.5.

              • Regularization and robustness

              One of the poorly understood advantages of Bayesian analysis is to use weakly informative priors in order to regularize and create robust inference. This essentially means that you have priors that do not impose any strong assumptions on the data, but which will automatically discount outlier values and keep inference stable. ROS discusses this in 9.5

              • Decision analysis

              Because of the ease from which we can propogate uncertainty in an estimation or regression, Bayesian statistics are often used in decision analysis. That is, we can take the inference from, for example, a regression and then use that to make some decision. Below I give a simple example with made-up data. A more fully worked example is provided by Thomas Wiecki

              • Hierarchical Models/Multilevel models

              This is one of the main modern applications of Bayesian methods. In fact, you have probably seen a Hierachcial model in practice. The election forecasting models used by the New York Times, The Economist and 538.com (among others) are Bayesian Hierachical models. Here you have multiple sources of information over different geographic and other groupings. You have many surveys of voter intentions - in the case of the US - for different states or even counties. You also have national surveys. How do you combine all that information to make a single forecast for who will win a presidential election? You need a model that takes state-level data for each state and aggregates upward. But you also want to use information from national surveys to inform the state-level inference. You can do this with Bayesian multilevel models. ROS CH. 22 gives an introduction to multilevel models, but we do not cover it further in this course.

              • Time Series Models and Forecasting

              Modern time series analysis are often formulated as "state-space" models - basically meaning that you analyse a time series as a process of transitions from various, unobserved states (like not pandemic to pandemic, or peace in europe to war in europe). The flexibility of Bayesian analysis is useful in estimating the parameters in such models. We discuss time series models at the end of this course, but we do not take an explicitly Bayesian approach.

              • Machine learning and Artificial intelligence applications

              Bayesian methods, and bayesian thinking are common within machine learning and artificial intelligence. You may for example encounter Naive Bayesian Classification or Bayesian Networks. We leave such methods for courses specializing in machine learning and AI.

              Exercises¶

              (You do not need to use PyMC in order to complete the following problems.)

              1 Two coins in a box¶

              Suppose you have two coins in a box. One is a normal coin with heads on one side and tails on the other, and one is a trick coin with heads on both sides. You choose a coin at random and see that one of the sides is heads. What is the probability that you chose the trick coin?

              (From Think Bayes By Allen Downey )

              2 Monty hall again¶

              There are many variations of the Monty Hall problem. For example, suppose Monty always chooses Door 2 if he can, and only chooses Door 3 if he has to (because the car is behind Door 2). If you choose Door 1 and Monty opens Door 2, what is the probability the car is behind Door 3? If you choose Door 1 and Monty opens Door 3, what is the probability the car is behind Door 2?

              (From Think Bayes By Allen Downey)

              3 Decision Analysis:¶

              Simulation for decision analysis: An experiment is performed to measure the efficacy of a television advertising program. The result is an estimate that each minute spent on a national advertising program will increase sales by € 500 000, and this estimate has a standard error of €200 000. Assume the uncertainty in the treatment effect can be approximated by a normal distribution. Suppose ads cost €300 000 per minute. What is the expected net gain for purchasing 20 minutes of ads? What is the probability that the net gain is negative?

              4 Decision Analysis Free Problem:¶

              Based on a regression you have completed earlier or a new regression, construct a decision analysis problem and find a solution.

              References¶

              Downey, Allen. Think Bayes 2. http://allendowney.github.io/ThinkBayes2/

              Salvatier J., Wiecki T.V., Fonnesbeck C. (2016) Probabilistic programming in Python using PyMC3. PeerJ Computer Science 2:e55 DOI: 10.7717/peerj-cs.55.