Bayesian statistics and Bayesian analysis has becoming increasingly popular tool for both machine learning and more traditional regression applications. But it is not new. In fact, the mathematics and philosophy of Bayesian analysis goes back 300 years to the minister and amateur mathematician Thomas Bayes, who gives his name to the powerful Bayes' Theorem.
Though 300 years old, Bayes' theorem is still a powerful, if sometimes tricky tool for analysis and even daily decision making.
Here is a probability problem from recent real life:
Around Christmas 2021 my son and I met my son's cousins together with my brother-in-law. The next day, I get a phone call from my brother-in-law, who tells me that his oldest daughter had awoken with a cough and had tested positive for corona with a home quick-test ("Antigen test"). They would take her to get a more reliable PCR test, but it would take a day or two for the results to come in. If the results were confirmed, then we would be considered to be close contacts, and must spend 7 days in quarantine over Christmas. Bummer.
But I knew that these tests were not perfect, and people regularly test positive who do not have corona and some test negative when they actually do have corona. So, the probability question I then had was, what was the probability that my niece actually had corona given that she had tested positive for Corona with a quick test. Let's write this mathematically:
$$P(Corona=1|Test=1)$$Above, we have written a conditional probability, which we can read The probability of corona given that the test was positive.
So first, I need to find some basic information. I searched a bit to find studies of how accurate these quick tests are. I found some information on the reliable Cochrane Reports, which is a non-profit organisation with the goal of providing well researched and financially independent medical and health information.
From the Cochrane reports, a study finds that with people who are confirmed to have corona, 72% of tests were positive (given that the person had symptoms). So is this our answer? Not quite, we can write this probability as:
$$P(Test=1|Corona=1)=.72$$That is, the probability of a test being positive, given that a person has Corona is .72. It is the mirror image of what we want to find out.
But we can still estimate our wished-for probability, but first we need some more information.
We first need information on the general prevalence of corona in the population. The probability that I am looking for is, before the test results, what is the overall chance that my niece had corona, rather than just a common cold, flu or a light cough?
At the time, the number of registered corona cases was relatively low - around 4-5000 registered new cases per day. So it seems fairly conservative to say that at any given time, 1/100 people had the disease in Norway at that time. But perhaps if you consider that there are more children in that period who were getting sick, and the fact that my niece had been showing some symptoms, perhaps the probability should be closer to 1/10. Since we are a bit uncertain, we could then create a range of values and say that the prior probability of corona is between:
$$P(Corona=1) = .01$$and
$$P(Corona=1)=.1$$That is, before the information we have on the test, we might guess that her chance of having corona is between 1/100 and 1/10.
A final piece of information we need is the rate of false-positives. That is, the probability of someone that does not have corona still testing positive. This is referred to what is called the "specificity" of the test, which measures the probability of someone who is negative actually testing negative. The data for this seems to indicate a very high specificity - close to a 100 percent for most tests. To be conservative, we can assume a specificity of .99:
$$P(Test=0|Corona=0) = .99$$Thus, we can find the rate of false positives as:
$$P(Test=1 | Corona=0) = 1-P(Test=0|Corona=0) = .01$$Thus, the total probability of testing positive is just the weighted sum of the probability of testing positive if a person actually is positive (weighed by the proportion of people who actually have corona), and the probability of testing positive if a person is actually negative (weighted by the proportion of people who do not have corona). So in the scenario where we think the prior probability of having corona is 1/10:
$$P(Test=1) = P(Test=1 | Corona=0)*P(Corona=0) + P(Test=1|Corona=1)*P(Corona=1) = .01*(9/10) + .72*(1/10)$$We can calculate both high and low scenarios
PT_low = .01*(99/100) + .72*(1/100)
PT_high = .01*(9/10) + .72*(1/10)
print(PT_low, PT_high)
0.0171 0.08099999999999999
We now have enough information to calculate our desired probability: $P(Corona=1|Test=1)$. Bayesian analysis has sometimes been called inverse probability, since we are deducing the inverse of the information we have, $P(Test=1|Corona=1)$. To find the inverse probability, we use the formula from Thomas Bayes, now called Bayes' Formula or Bayes' Theorem:
$$P(A|B) = \frac{P(A)*P(B|A)}{P(B)}$$Or specifically, in our case:
$$P(Corona=1|Test=1) = \frac{P(Corona=1)*P(Test=1|Corona=1)}{P(Test=1)}$$We can then create a low and high scenario:
$$P(Corona=1|Test=1) = \frac{.01*.72}{.017}$$$$P(Corona=1|Test=1) = \frac{.1*.72}{.08}$$PCgT_low = (.01*.72)/PT_low
PCgT_high = (.1*.72)/PT_high
print(PCgT_low, PCgT_high)
0.42105263157894735 0.888888888888889
So depending on what prior information we include, our probability of actually having corona could be as low as 42% or as high as 88%
One of the key features and advantages of Bayesian analysis is the ability to include prior information, in this case, that was the information on the probability of having Corona before the information from the test. The result, especially under the low scenario may seem surprising. That a positive result may give us less-than a 50% probability that we actually have corona. But we could have come to this probability simply by counting out the possible scenarios.
If we think that 1000 people take a test that day and 990/1000 (99%) people don't actually have corona and take the test then, knowing the information about the false positive results then 990*.01 = 9.9 of those people will test positive (falsely).
Of the 10/1000 people who have corona, .72*10 = 7.2 will test positive.
So out of about 17 positive tests, only 7.2/17 = .42 actually come from someone that has corona.
This shows the power of being able to include prior information into an analysis (the chance of having corona prior to taking the test), as well as the intuitive power of Bayes.
Yes, the PCR test came out positive and I spent 7 days in quarantine, but I did not get Corona (then).
One way we could interpret the scenario above, is that we start with our initial information: that my niece wakes up with a cough, and an initial probability of between 1/10 and 1/100 that she has Corona.
We then get in some new data: the test results, which will update our probability, which we then can refer to as our posterior probability
We could continue in this way. In our discussion above, my niece took a PCR test, and of course this also is subject to uncertainty. So we could have treated our previous posterior probability as our new prior information, and then updated our probability again with the PCR test.
We'll consider one more simple, yet devilishly unintuitive problem called the Monty Hall problem. Here is a description of the problem from Downey:
The Monty Hall problem is based on a game show called Let's Make a Deal. If you are a contestant on the show, here's how the game works: The host, Monty Hall, shows you three closed doors -- numbered 1, 2, and 3 -- and tells you that there is a prize behind each door. One prize is valuable (traditionally a car), the other two are less valuable (traditionally goats). The object of the game is to guess which door has the car. If you guess right, you get to keep the car. Suppose you pick Door 1. Before opening the door you chose, Monty opens Door 3 and reveals a goat. Then Monty offers you the option to stick with your original choice or switch to the remaining unopened door.
To maximize your chance of winning the car, should you stick with Door 1 or switch to Door 2?
To answer this question, we have to make some assumptions about the behavior of the host:
Monty always opens a door and offers you the option to switch.
He never opens the door you picked or the door with the car.
If you choose the door with the car, he chooses one of the other doors at random.
The intuitive answer for many people is to say that you still have a 50/50 chance and there is no advantage to switching. But this is wrong! Enumerating the possibilities, and making sure to include the information that the host uses in choosing a door to open will convince you that you should switch!
You could try a decision tree to convince yourself of this, or, as below we will follow downey in creating a Bayes table.
Below we create a table in the form of a Pandas data frame:
import pandas as pd
from fractions import Fraction
#so that we can use actual fractions (1/3) rather than floating point approximations (.33...)
We start by looking at our three options and the prior probability that the prize is in each door.
MHtable = pd.DataFrame(index=['Door 1', 'Door 2', 'Door 3'])
MHtable['prior'] = Fraction(1, 3)
MHtable
prior | |
---|---|
Door 1 | 1/3 |
Door 2 | 1/3 |
Door 3 | 1/3 |
Now we get the new "data", which is that monty opens door 3 to reveal a goat.
So now we consider the likelihood of our scenario that is: the probability of the data given our hypothesis:
MHtable['likelihood'] = Fraction(1, 2), 1, 0
MHtable
prior | likelihood | |
---|---|---|
Door 1 | 1/3 | 1/2 |
Door 2 | 1/3 | 1 |
Door 3 | 1/3 | 0 |
So now the last step is to update our Bayes table according to Bayes rule:
$$P(A|B) = \frac{P(A)*P(B|A)}{P(B)}$$So we could write the probability that the car is behind door 1 given that door 3 is opened as:
$$P(Door1|Door3opened) = \frac{P(Door1)*P(Door3|Door1)}{P(Door3)}$$$$P(Door1|Door3) = \frac{(1/3)*(1/2)}{(1/3)*(1/2) + (1/3)*1} = 1/3$$$$P(Door1|Door3) = \frac{(1/3)*(1/2)}{(1/3)*(1/2) + (1/3)*1} = 1/3$$The following function will do the Bayesian updating in our table automatically
def update(table):
"""Compute the posterior probabilities."""
table['unnorm'] = table['prior'] * table['likelihood']
prob_data = table['unnorm'].sum()
table['posterior'] = table['unnorm'] / prob_data
return prob_data
update(MHtable)
MHtable
prior | likelihood | unnorm | posterior | |
---|---|---|---|---|
Door 1 | 1/3 | 1/2 | 1/6 | 1/3 |
Door 2 | 1/3 | 1 | 1/3 | 2/3 |
Door 3 | 1/3 | 0 | 0 | 0 |
Bayesian statistics and Bayesian regression is, in the simplist sense, the further application of the probabilistic thinking and enumeration we went through in the above examples and applying them to statistical analysis and regression.
This means:
We can include prior information in our estimation
We treat parameters as uncertain, which in most modern applications we estimate through simulation
We can easily update our estimates when new data become available
We can propogate uncertainty to derivative values - like predictions or, for example, measures of costs or benefits.
Compare this to traditional frequentist methodology, which is what is mostly taught in introduction courses
It is not possible to make use of outside, or prior information - all relevant information is assumed to be in the data
Estimated coefficients are assumed to be fixed - these are treated simply as calculated values from the sample. They are compared to a null hypothesis and/or null model , from which we judge whether these values are "statistically significant".
It can be difficult to calculate uncertainty in derivative values, like predictions, and we must often use forms of simulation here as well, such as bootstrapping.
Bayesian analysis and statistics is a big and growing field. Several excellent introductory texts exists to learn more for those interested:
One important application of Bayesian analysis is within what is called Decision Analysis. The basic idea of decision analysis is to propogate the uncertainty from an estimation (like a regression) into some meaningful function that affects a decision, strategy or operation.
Here we will show an example using electricity data and a made-up decision on whether an electricity monopolist should invest in more wind power.
We won't use actual Bayesian estimation, instead relying on OLS. But we will make use of the simulated posterior uncertainty.
The data is real, but the model is very simplified.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as spt
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns
from cycler import cycler
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
plt.rcParams["axes.labelsize"]= 12
plt.rcParams["figure.facecolor"] = "#f2f2f2"
#plt.rcParams['figure.savefig.dpi'] = 100
plt.rcParams['savefig.edgecolor'] = "#f2f2f2"
plt.rcParams['savefig.facecolor'] ="#f2f2f2"
plt.rcParams["figure.figsize"] = [16,10]
plt.rcParams['savefig.bbox'] = "tight"
plt.rcParams['font.size'] = 14
greens = ['#66c2a4','#41ae76','#238b45','#006d2c','#00441b']
multi =['#66c2a4','#1f78b4','#a6cee3','#b2df8a','#33a02c','#fb9a99','#e31a1c','#fdbf6f']
plt.rcParams["axes.prop_cycle"] = cycler(color=multi)
we'll load in data on the Nordic electricity market at an hourly frequency
elDF = pd.read_csv("http://jmaurit.github.io/analytics/labs/data/wt_data2.csv")
elDF
date | time | hour | wind_SE1 | wind_SE2 | wind_SE3 | wind_SE4 | wind_DK1 | wind_DK2 | SE_nx | ... | PLAtoPL_cap | PLtoPLA_cap | SE4toLT_cap | LTtoSE4_cap | SYSEurMW | SE4EurMW | DK1EurMW | DK2EurMW | hour_ind | month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-01-01 00:00:00 | 2016-01-01 00:00:00 | 00 - 01 | 420.0 | 1247.0 | 432.0 | 208.0 | 1314.0 | 113.0 | -2621.0 | ... | 0.0 | 3600.0 | 0.0 | 0.0 | 1639.0 | 1639.0 | 1639.0 | 1639.0 | 0 | m1 |
1 | 2016-01-01 00:00:00 | 2016-01-01 01:00:00 | 01 - 02 | 416.0 | 1214.0 | 419.0 | 187.0 | 1267.0 | 98.0 | -2274.0 | ... | 0.0 | 3900.0 | 0.0 | 0.0 | 1604.0 | 1604.0 | 1604.0 | 1604.0 | 1 | m1 |
2 | 2016-01-01 00:00:00 | 2016-01-01 02:00:00 | 02 - 03 | 417.0 | 1227.0 | 378.0 | 177.0 | 1159.0 | 74.0 | -2370.0 | ... | 0.0 | 3900.0 | 0.0 | 0.0 | 1574.0 | 1574.0 | 1574.0 | 1574.0 | 2 | m1 |
3 | 2016-01-01 00:00:00 | 2016-01-01 03:00:00 | 03 - 04 | 424.0 | 1232.0 | 357.0 | 173.0 | 1160.0 | 61.0 | -2541.0 | ... | 0.0 | 4000.0 | 0.0 | 0.0 | 1557.0 | 1557.0 | 1557.0 | 1557.0 | 3 | m1 |
4 | 2016-01-01 00:00:00 | 2016-01-01 04:00:00 | 04 - 05 | 412.0 | 1245.0 | 326.0 | 161.0 | 1069.0 | 47.0 | -2616.0 | ... | 0.0 | 4100.0 | 0.0 | 0.0 | 1547.0 | 1547.0 | 1547.0 | 1547.0 | 4 | m1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
26299 | 2018-12-31 00:00:00 | 2018-12-31 19:00:00 | 19 - 20 | NaN | NaN | NaN | NaN | 3335.0 | 800.0 | -5452.0 | ... | 2019.0 | 1011.0 | 700.0 | 281.0 | 4883.0 | 4751.0 | 4751.0 | 4751.0 | 19 | m12 |
26300 | 2018-12-31 00:00:00 | 2018-12-31 20:00:00 | 20 - 21 | NaN | NaN | NaN | NaN | 3147.0 | 865.0 | -4919.0 | ... | 976.0 | 1551.0 | 700.0 | 220.0 | 4723.0 | 4582.0 | 4582.0 | 4582.0 | 20 | m12 |
26301 | 2018-12-31 00:00:00 | 2018-12-31 21:00:00 | 21 - 22 | NaN | NaN | NaN | NaN | 2859.0 | 902.0 | -4493.0 | ... | 208.0 | 2229.0 | 700.0 | 204.0 | 4602.0 | 4303.0 | 4303.0 | 4303.0 | 21 | m12 |
26302 | 2018-12-31 00:00:00 | 2018-12-31 22:00:00 | 22 - 23 | NaN | NaN | NaN | NaN | 2708.0 | 792.0 | -4165.0 | ... | 0.0 | 2800.0 | 700.0 | 148.0 | 4555.0 | 3854.0 | 3854.0 | 3854.0 | 22 | m12 |
26303 | 2018-12-31 00:00:00 | 2018-12-31 23:00:00 | 23 - 00 | NaN | NaN | NaN | NaN | 2719.0 | 785.0 | -3915.0 | ... | 0.0 | 3127.0 | 700.0 | 119.0 | 4269.0 | 2570.0 | 2570.0 | 2570.0 | 23 | m12 |
26304 rows × 90 columns
Let's consider the simplist of regressions between the price of wholesale electricity in one of the Danish price areas (DK1) and the amount of wind power production.
(By the way, I would not recommend the below as a good regression model for this data.
mod1wind = smf.ols("DK1EurMW ~ wind_DK1", data=elDF).fit()
mod1wind.summary()
Dep. Variable: | DK1EurMW | R-squared: | 0.101 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.101 |
Method: | Least Squares | F-statistic: | 2949. |
Date: | Thu, 11 May 2023 | Prob (F-statistic): | 0.00 |
Time: | 13:52:57 | Log-Likelihood: | -2.2683e+05 |
No. Observations: | 26301 | AIC: | 4.537e+05 |
Df Residuals: | 26299 | BIC: | 4.537e+05 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 3958.6249 | 13.804 | 286.771 | 0.000 | 3931.568 | 3985.682 |
wind_DK1 | -0.5077 | 0.009 | -54.305 | 0.000 | -0.526 | -0.489 |
Omnibus: | 2763.096 | Durbin-Watson: | 0.096 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 6128.193 |
Skew: | 0.652 | Prob(JB): | 0.00 |
Kurtosis: | 4.973 | Cond. No. | 2.45e+03 |
So from the the regression model we see a negative coefficient on the wind power variable. On average, an extra megawatt-hour (MWH) (a standard unit of electricity production) of wind power wil lower prices by .51 Euro/MWH.
Let us say that a Danish monopolist electricity utility is deciding whether to invest in a new large off-shore wind park.
We want to run an analysis to determine whether the monopoly should make the investment (we assume they are profit maximizing).
(all these numbers are completly unrealistic)
We'll start by simulating from the "posterior" of our regression model (posterior is in quotes, because formally when we refer to the posterior of a model, then we should have a bayesian model, which we don't.)
What the below block of code does is take the point estimates and variance-covariance matrix from the regression, and produce a simulation of a posterior distribution assuming a normal distribution. You can look through the code, but don't worry if you don't understand all the steps.
def regSim(regMod):
#extract values from regression model
nmk = regMod.df_resid #n-k
sigma_hat = np.sqrt(regMod.mse_resid)
bs_vcov = regMod.cov_params(scale=1)
bs = regMod.params
#create simulated values
sigma_sim = sigma_hat*np.sqrt((nmk/np.random.chisquare(nmk,1)))
V_sim = np.array(bs_vcov) * sigma_sim**2
bs_sim = np.random.multivariate_normal(bs, V_sim, 1)
return([bs_sim.flatten(), sigma_sim])
Now we simulate parameters from our regression models 1000 times, also similar to what we have done in earlier labs
sigma_hat = np.sqrt(mod1wind.mse_resid) #estimate of sigma, sigma_hat
nsim = 1000
bs_sims = []
sigma_sims = []
for s in range(nsim):
param_sim = regSim(mod1wind)
bs_sims.append(param_sim[0])
sigma_sims.append(param_sim[1])
simsDF = pd.DataFrame(bs_sims)
simsDF
0 | 1 | |
---|---|---|
0 | 3971.254744 | -0.514325 |
1 | 3970.117303 | -0.506004 |
2 | 3956.200387 | -0.510918 |
3 | 3950.656899 | -0.505535 |
4 | 3941.609104 | -0.493350 |
... | ... | ... |
995 | 3975.594169 | -0.514547 |
996 | 3951.050708 | -0.492996 |
997 | 3965.325493 | -0.505766 |
998 | 3966.999980 | -0.512246 |
999 | 3962.317763 | -0.510511 |
1000 rows × 2 columns
We can visualise the predictive distribution of our model as we have also done previously
x_line = np.arange(0,4000,1)
fig, ax = plt.subplots()
elDF.plot.scatter(x="wind_DK1", y="DK1EurMW", ax=ax, alpha=.2)
for i, row in simsDF.iloc[:100,:].iterrows():
y_line = row[0] + row[1]*x_line + np.random.normal(0,sigma_hat)
ax.plot(x_line, y_line, color="grey", alpha=.5)
Now we'll use our simulations to create distributions for our expected prices with and without the investment.
For the no-investment scenario, we will fix the wind power variable at its mean value for simplicity.
For the investment scenario, we add an expected .5*400=200 MWH of wind production
meanWind = np.mean(elDF.wind_DK1)
meanWind #mean amount of wind
newWind = .5*400
expectedWind = meanWind + newWind
simsDF["simPrices"] = simsDF.iloc[:,0] + simsDF.iloc[:,1]*meanWind + np.random.normal(0,sigma_hat, nsim)
simsDF["simPricesHighWind"] = simsDF.iloc[:,0] + simsDF.iloc[:,1]*expectedWind + np.random.normal(0,sigma_hat, nsim)
We can now look at the histogram of the expected prices for the two scenarios
fig, ax = plt.subplots()
simsDF.simPrices.hist(bins=100, ax=ax, label="No investment", alpha=.5)
simsDF.simPricesHighWind.hist(bins=100, ax=ax, label="investment", alpha=.5)
ax.set_xlabel("Predicted prices")
ax.legend()
<matplotlib.legend.Legend at 0x7fc0c2125e50>
Now we define our profit function, which will take the simulated prices and output an expected profit for the two scenarios.
First we define some parameters
InvestmentCost = 300000000 #investment cost
ExistingWindCap = 500 #How much existing wind capacity the company owns
NewWind = 400 #Size in capacity of new investment
cfWind = .5 #capacity factor of wind power
#no running/marginal cost for wind
thermalCap = 1000 #How much thermal capacity the company owns
mcThermal = 2000 #The marginal cost of the thermal capacity owned by the company
n = 1000 #1000 periods, relevant economic period to evaluate profitability.
#only run thermal if price>mc
Now we are ready to create a profit function.
The function takes as input a row of the simulation data frame.
We calculate a variable for operating profit for both scenarios (operProfit, operProfitHighWind). The operating profit is market price times the expected amount of wind power produced. In addition, if the price is above the marginal cost of the thermal generation, then the company also gets operating profit from the difference between the market price and the marginal cost.
The total profit for the no-invesment scenario is simply a multiple of the operating profit times the number of periods that we consider (n=1000).
For the investment scenario, we must subtract the Investment cost variable.
The function returns a row with two new columns representing the expected profit for both scenarios
def profitFunction(row):
#calculate profit without new investment
operProfit = row.simPrices*ExistingWindCap*cfWind
if row.simPrices > mcThermal:
operProfit += (row.simPrices-mcThermal)*thermalCap
row["profit"] = operProfit*n #multiply predicted operating profit by number of periods
#calculate profit with new investment
operProfitHighWind = row.simPricesHighWind*(ExistingWindCap+newWind)*cfWind
if row.simPricesHighWind > mcThermal:
operProfitHighWind += (row.simPricesHighWind-mcThermal)*thermalCap
row["profitHighWind"] = operProfitHighWind*n - InvestmentCost
return(row)
newSimsDF = simsDF.apply(profitFunction, axis=1)
newSimsDF
0 | 1 | simPrices | simPricesHighWind | profit | profitHighWind | |
---|---|---|---|---|---|---|
0 | 3952.788759 | -0.493896 | 5012.876829 | 3142.103835 | 4.266096e+09 | 1.941840e+09 |
1 | 3957.635557 | -0.503402 | 3282.681386 | 712.747128 | 2.103352e+09 | -5.053851e+07 |
2 | 3957.711510 | -0.503402 | 1146.672129 | 2941.187571 | 2.866680e+08 | 1.670603e+09 |
3 | 3933.209222 | -0.491248 | 3878.110936 | 2215.606619 | 2.847639e+09 | 6.910689e+08 |
4 | 3969.419491 | -0.504066 | 1409.517923 | 4814.874590 | 3.523795e+08 | 4.200081e+09 |
... | ... | ... | ... | ... | ... | ... |
995 | 3956.120951 | -0.507384 | 3659.349398 | 3047.835291 | 2.574187e+09 | 1.814578e+09 |
996 | 3966.880672 | -0.515162 | 4529.915873 | 3013.219574 | 3.662395e+09 | 1.767846e+09 |
997 | 3989.846916 | -0.523470 | 3797.144854 | 3008.258164 | 2.746431e+09 | 1.761149e+09 |
998 | 3979.523098 | -0.521184 | 3466.670350 | 3837.460831 | 2.333338e+09 | 2.880572e+09 |
999 | 3959.795794 | -0.511533 | 2441.090415 | 2737.520674 | 1.051363e+09 | 1.395653e+09 |
1000 rows × 6 columns
We can now look at the distributions of potential profit scenarios
fig, ax = plt.subplots()
newSimsDF.profit.hist(bins=100, ax=ax, label="No investment", alpha=.5)
newSimsDF.profitHighWind.hist(bins=100, ax=ax, label="investment",alpha=.5)
ax.set_xlabel("Predicted profit")
ax.legend()
<matplotlib.legend.Legend at 0x7fc0c1c27e50>
Now that we have a distribution of possible profits for the two scenarios, we need to devise a criteria for making our decision.
A straight-forward criteria would be to compare the expectation (aka mean) profit, and see which is higher.
expectedProfit = newSimsDF.profit.mean()
expectedProfitHighWind = newSimsDF.profitHighWind.mean()
print("Expected Profit, no investment", expectedProfit)
print("Proift, investment",expectedProfitHighWind)
Expected Profit, no investment 2436272596.907039 Proift, investment 2286089024.467252
expectedProfitHighWind>expectedProfit
False
Under this rule we would not make the investment. But this is by no means the only viable decision rule. We could, for example, also look at which decision will lead to smallest probability of loss.
nLoss = np.sum(newSimsDF.profit<0)/nsim
nLossHighWind = np.sum(newSimsDF.profitHighWind<0)/nsim
print("Probability of loss, no investment", nLoss*100)
print("Probability of loss, investment", nLossHighWind*100)
Probability of loss, no investment 0.2 Probability of loss, investment 3.4000000000000004
Here the probability of a loss is higher if we invest
Now imagine a scenario where an independent power producer wanted to come into the market and invest in the wind park. Then we could analyse the profitability of the wind park itself, without taking into account the effect on the other generators.
We do this below
windParkNetProfits = newSimsDF.simPricesHighWind*NewWind*cfWind*n-InvestmentCost
fig, ax = plt.subplots()
windParkNetProfits.hist(bins=50, ax=ax)
ax.set_ylabel("Net profit of wind park independently")
Text(0, 0.5, 'Net profit of wind park independently')
In the above example, we have run a regression using OLS, and then simulated from the results, giving a Bayesian interpretation to a frequentist estimation. This is a limited way of doing a Bayesian-(like) analysis. In order to estimate truly Bayesian models where we can incorporate prior information and propogate uncertainty we need to make use of a simulation methodology called Markov-Chain Monte Carlo (MCMC). The details are beyond the scope of this course. But we call still make use of tools for Bayesian analysis using MCMC in Python.
The main package for Bayesian analysis and regression using MCMC in Python is called PYMC.
PyMC, like other MCMC simulation tools, makes use of a C++ interpreter to run code and models. This means there are system requirements to have a C++ interpreter installed on your PC/MAC and correctly set-up. My experience is that this can be a source of many technical problems, which can be difficult to resolve.
We'll go through some installation instructions, but if you can't get PYMC installed and working, I might suggest simply skipping this portion of the lab. There is no expectation to include Bayesian analysis with MCMC simulation in a final project.
We'll follow the instructions on the PyMC website for installing.
The above assumes you have installed anaconda, which we should have if you have been following the course.
The environment is basically a workspace with different configurations of python. This means that every time you want to use pymc, you first need to activate the environment by writing in conda activate pymc3_env, or whatever you called it in your terminal. And when you want to quit that environment (and go back to your standard environment) you would write in conda deactivate
If you have a mac, you may also need to install Xcode (you can find it in the mac app store). Once you have installed Xcode, you may also need to install Command Line tools. This you can do by the following command in the terminal:
xcode-select --install
We'll first go through the Bayesian regression model as discussed in this tutorial to get a handle on using PyMC. You should read through the entire documentation.
Before we start, you may need to close your jupyter notebook and terminal. Then do the following steps:
Open a new terminal window
Activate your pymc environment by writing
conda activate pymc_env
You may also need to re-install jupyter notebooks for your pymc environment
conda install jupyter
Navigate to the file where you have your notebook, then open your notebook.
Then we should be ready to go
(Except if you get an error message about not being able to run iPython kernel in the pymc environment. You may then need to run your code directly in the terminal.)
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
import xarray as xr
from pymc import HalfCauchy, Model, Normal, sample
print(f"Running on PyMC v{pm.__version__}")
RANDOM_SEED = 8927
rng = np.random.default_rng(RANDOM_SEED)
%config InlineBackend.figure_format = 'retina'
az.style.use("arviz-darkgrid")
Below a simple "fake" dataset is generated
size = 200
true_intercept = 1
true_slope = 2
x = np.linspace(0, 1, size)
# y = a + b*x
true_regression_line = true_intercept + true_slope * x
# add noise
y = true_regression_line + rng.normal(scale=0.5, size=size)
data = pd.DataFrame(dict(x=x, y=y))
Here is a visualization of the data together with the "true" regression line
fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(111, xlabel="x", ylabel="y", title="Generated data and underlying model")
ax.plot(x, y, "x", label="sampled data")
ax.plot(x, true_regression_line, label="true regression line", lw=2.0)
plt.legend(loc=0);
Below is the standard pymc3 setup of a bayesian model.
It starts by defining prior distributions for the model: for the model uncertainty, sigma, the intercept term and the coefficient on the x term. These priors are designed to be fairly "weakly-informative" or "semi-informative", that is to say, we are not imposing much prior knowledge into the distributions, and only imposing some "regularization" - that is automatically discounting outlier data points. The bayesian estimation will probably be similar to maximum likelihood or OLS.
Here we are writing that we consider the y-variable to be normally distributed with mean values equal to the regression equation:
$$Y \sim N(\alpha + \beta x, \sigma)$$Notice what this means: we are modelling each Y-value as having its own probability distribution, where with our typical modelling we give each Y-value a single "fitted value".
Finally, we tell our MCMC algorithm (called a "sampler" in the sense that we are randomly sampling data from the probability model we have defined), to run 3000 simulations. Hopefully the below code will run for you:
with Model() as model: # model specifications in PyMC are wrapped in a with-statement
# Define priors
sigma = HalfCauchy("sigma", beta=10)
intercept = Normal("Intercept", 0, sigma=20)
slope = Normal("slope", 0, sigma=20)
# Define likelihood
likelihood = Normal("y", mu=intercept + slope * x, sigma=sigma, observed=y)
# Inference!
# draw 3000 posterior samples using NUTS sampling
idata = sample(3000)
At a practical level, a big difference between bayesian analysis and the tools we have been using is that we are not doing a form of optimization: that is finding parameters that maximize likelihood (Maximum likelihood) or minimize squared erros (OLS).
Instead, we are setting up a probability model (prior, likelihood/data etc) and then "sampling" from the resulting posterior distribution.
So instead of getting as output single point estimates for our coefficients, we get sampled values for each coefficient. Basically, we get as output distributions for each parameter.
We can use the package arviz (az) to plot one of the distributions of our three parameters (intercept, parameter on our x value and our sigma):
az.plot_trace(idata, var_names="slope")
array([[<AxesSubplot: title={'center': 'slope'}>, <AxesSubplot: title={'center': 'slope'}>]], dtype=object)
If we wanted to generate a point estimate, we could, for example take the mean value of our distribution for a given paramaeter - for example our coefficient on our slope variable:
idata.posterior.slope
<xarray.DataArray 'slope' (chain: 4, draw: 3000)> array([[2.07844938, 2.00946818, 1.98094884, ..., 1.92221807, 2.09151181, 2.222756 ], [1.96990623, 2.05698143, 2.0651126 , ..., 2.09183843, 2.08851394, 2.1722263 ], [2.25853188, 2.16465264, 1.72929357, ..., 1.82192198, 1.97196397, 2.13588937], [2.13479654, 2.02128419, 2.03796066, ..., 2.0216993 , 1.86120875, 2.24262356]]) Coordinates: * chain (chain) int64 0 1 2 3 * draw (draw) int64 0 1 2 3 4 5 6 7 ... 2993 2994 2995 2996 2997 2998 2999
np.mean(idata.posterior.slope)
<xarray.DataArray 'slope' ()> array(2.05917285)
And we could generate a standard error by simply calculating the standard deviation
np.std(idata.posterior.slope)
<xarray.DataArray 'slope' ()> array(0.12981255)
We can also generate a table of summary statistics
az.summary(idata.posterior, kind="stats")
mean | sd | hdi_3% | hdi_97% | |
---|---|---|---|---|
Intercept | 1.046 | 0.075 | 0.910 | 1.187 |
slope | 2.059 | 0.130 | 1.820 | 2.310 |
sigma | 0.524 | 0.026 | 0.474 | 0.574 |
We can then plot posterior predictive chart
with model:
pm.sample_posterior_predictive(idata, extend_inferencedata=True, random_seed=rng)
az.plot_ppc(idata, num_pp_samples=100)
Sampling: [y]
<AxesSubplot: xlabel='y / y'>
The posterior predictive check compares the distribution of the y-variable in the actual data with distributions of the data from the estimated data. If the two do not align well, then we probably have a big problem.
Here we will run a regression comparing earnings and height, taken from Regression and Other Stories (ROS), using Bayesian MCMC.
We start by downloading the dataset from the ROS website
earnings = pd.read_csv("https://raw.githubusercontent.com/avehtari/ROS-Examples/master/Earnings/data/earnings.csv")
earnings
height | weight | male | earn | earnk | ethnicity | education | mother_education | father_education | walk | exercise | smokenow | tense | angry | age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 74 | 210.0 | 1 | 50000.0 | 50.0 | White | 16.0 | 16.0 | 16.0 | 3 | 3 | 2.0 | 0.0 | 0.0 | 45 |
1 | 66 | 125.0 | 0 | 60000.0 | 60.0 | White | 16.0 | 16.0 | 16.0 | 6 | 5 | 1.0 | 0.0 | 0.0 | 58 |
2 | 64 | 126.0 | 0 | 30000.0 | 30.0 | White | 16.0 | 16.0 | 16.0 | 8 | 1 | 2.0 | 1.0 | 1.0 | 29 |
3 | 65 | 200.0 | 0 | 25000.0 | 25.0 | White | 17.0 | 17.0 | NaN | 8 | 1 | 2.0 | 0.0 | 0.0 | 57 |
4 | 63 | 110.0 | 0 | 50000.0 | 50.0 | Other | 16.0 | 16.0 | 16.0 | 5 | 6 | 2.0 | 0.0 | 0.0 | 91 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1811 | 61 | 120.0 | 0 | 15000.0 | 15.0 | White | 18.0 | 18.0 | 18.0 | 6 | 1 | 2.0 | 0.0 | 0.0 | 82 |
1812 | 64 | 130.0 | 0 | 8000.0 | 8.0 | White | 12.0 | 12.0 | 12.0 | 1 | 1 | 1.0 | 7.0 | 7.0 | 33 |
1813 | 72 | 194.0 | 1 | 60000.0 | 60.0 | White | 12.0 | 12.0 | 12.0 | 2 | 1 | 2.0 | 0.0 | 0.0 | 50 |
1814 | 63 | 155.0 | 0 | 15000.0 | 15.0 | Other | 14.0 | 14.0 | 14.0 | 6 | 1 | 2.0 | 2.0 | 2.0 | 69 |
1815 | 68 | 150.0 | 1 | 6000.0 | 6.0 | White | 12.0 | 12.0 | 12.0 | 1 | 6 | 1.0 | 2.0 | 2.0 | 27 |
1816 rows × 15 columns
Let's start by transforming the variables in our analysis to be standardized - we subtract the mean and divide by the standard deviation
earn_s = (earnings.earn - np.mean(earnings.earn))/np.std(earnings.earn)
height_s = (earnings.height - np.mean(earnings.height))/np.std(earnings.height)
male = earnings.male
We start with an analysis with weakly informative priors
with Model() as earnings_mod: # model specifications in PyMC are wrapped in a with-statement
# Define our data
earnings = pm.MutableData("earnings", earn_s)
isMale = pm.MutableData("isMale", male)
height = pm.MutableData("height", height_s)
# Define priors
sigma = HalfCauchy("sigma", beta=10)
intercept = Normal("Intercept", 0, sigma=10)
beta_height = Normal("beta_height", 0, sigma=10)
beta_male = Normal("beta_male", 0, sigma=10)
# Define likelihood
likelihood = Normal("y", mu=intercept + beta_height * height + beta_male*isMale, sigma=sigma, observed=earnings)
# Inference!
# draw 3000 posterior samples using NUTS sampling
earnings_trace = sample(3000)
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (4 chains in 4 jobs) NUTS: [sigma, Intercept, beta_height, beta_male]
Sampling 4 chains for 1_000 tune and 3_000 draw iterations (4_000 + 12_000 draws total) took 21 seconds.
az.plot_trace(earnings_trace, var_names=["beta_height", "beta_male"])
array([[<AxesSubplot: title={'center': 'beta_height'}>, <AxesSubplot: title={'center': 'beta_height'}>], [<AxesSubplot: title={'center': 'beta_male'}>, <AxesSubplot: title={'center': 'beta_male'}>]], dtype=object)
Controlling for gender, the mean coefficient value can be estimated to be:
np.mean(earnings_trace.posterior.beta_height)
<xarray.DataArray 'beta_height' ()> array(0.10952402)
.11, which we can interpret to mean that a 1 std change in height leads to a .11 std increase in earnings.
We can also calculate a 95% credible interval using the hdi (highest density interval) function in arViz:
az.hdi(earnings_trace.posterior.beta_height,hdi_prob=.95)
<xarray.Dataset> Dimensions: (hdi: 2) Coordinates: * hdi (hdi) <U6 'lower' 'higher' Data variables: beta_height (hdi) float64 0.04898 0.1704
We can also easily generate a summary table:
az.summary(earnings_trace)
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
Intercept | -0.176 | 0.033 | -0.237 | -0.114 | 0.000 | 0.000 | 7191.0 | 8287.0 | 1.0 |
beta_height | 0.110 | 0.031 | 0.051 | 0.168 | 0.000 | 0.000 | 7159.0 | 7589.0 | 1.0 |
beta_male | 0.473 | 0.065 | 0.350 | 0.595 | 0.001 | 0.001 | 6665.0 | 7616.0 | 1.0 |
sigma | 0.950 | 0.016 | 0.921 | 0.980 | 0.000 | 0.000 | 7901.0 | 7768.0 | 1.0 |
In practice, for many modelling tasks, Bayesian estimation will give similar results to maximum likelihood or OLS. But there are multiple applications where Bayesian analysis works particularly well and where it would be difficult or impossible to do the same analysis with maximum likelihood or OLS estimation. We go quickly through a few of these applications:
The most direct, and sometimes controversial, advantage of Bayesian analysis is the ability to integrate outside information through informative priors. As we have seen, it is a misconceptions that you need to have prior information. But if you have information, outside of the relevant dataset you are analysing, that you want to integrate into the analysis, you can do so through an informative prior. ROS discusses some examples of including informative priors in an analysis in 9.4 and 9.5.
One of the poorly understood advantages of Bayesian analysis is to use weakly informative priors in order to regularize and create robust inference. This essentially means that you have priors that do not impose any strong assumptions on the data, but which will automatically discount outlier values and keep inference stable. ROS discusses this in 9.5
Because of the ease from which we can propogate uncertainty in an estimation or regression, Bayesian statistics are often used in decision analysis. That is, we can take the inference from, for example, a regression and then use that to make some decision. Below I give a simple example with made-up data. A more fully worked example is provided by Thomas Wiecki
This is one of the main modern applications of Bayesian methods. In fact, you have probably seen a Hierachcial model in practice. The election forecasting models used by the New York Times, The Economist and 538.com (among others) are Bayesian Hierachical models. Here you have multiple sources of information over different geographic and other groupings. You have many surveys of voter intentions - in the case of the US - for different states or even counties. You also have national surveys. How do you combine all that information to make a single forecast for who will win a presidential election? You need a model that takes state-level data for each state and aggregates upward. But you also want to use information from national surveys to inform the state-level inference. You can do this with Bayesian multilevel models. ROS CH. 22 gives an introduction to multilevel models, but we do not cover it further in this course.
Modern time series analysis are often formulated as "state-space" models - basically meaning that you analyse a time series as a process of transitions from various, unobserved states (like not pandemic to pandemic, or peace in europe to war in europe). The flexibility of Bayesian analysis is useful in estimating the parameters in such models. We discuss time series models at the end of this course, but we do not take an explicitly Bayesian approach.
Bayesian methods, and bayesian thinking are common within machine learning and artificial intelligence. You may for example encounter Naive Bayesian Classification or Bayesian Networks. We leave such methods for courses specializing in machine learning and AI.
(You do not need to use PyMC in order to complete the following problems.)
Suppose you have two coins in a box. One is a normal coin with heads on one side and tails on the other, and one is a trick coin with heads on both sides. You choose a coin at random and see that one of the sides is heads. What is the probability that you chose the trick coin?
(From Think Bayes By Allen Downey )
There are many variations of the Monty Hall problem. For example, suppose Monty always chooses Door 2 if he can, and only chooses Door 3 if he has to (because the car is behind Door 2). If you choose Door 1 and Monty opens Door 2, what is the probability the car is behind Door 3? If you choose Door 1 and Monty opens Door 3, what is the probability the car is behind Door 2?
(From Think Bayes By Allen Downey)
Simulation for decision analysis: An experiment is performed to measure the efficacy of a television advertising program. The result is an estimate that each minute spent on a national advertising program will increase sales by € 500 000, and this estimate has a standard error of €200 000. Assume the uncertainty in the treatment effect can be approximated by a normal distribution. Suppose ads cost €300 000 per minute. What is the expected net gain for purchasing 20 minutes of ads? What is the probability that the net gain is negative?
Based on a regression you have completed earlier or a new regression, construct a decision analysis problem and find a solution.
Downey, Allen. Think Bayes 2. http://allendowney.github.io/ThinkBayes2/
Salvatier J., Wiecki T.V., Fonnesbeck C. (2016) Probabilistic programming in Python using PyMC3. PeerJ Computer Science 2:e55 DOI: 10.7717/peerj-cs.55.