Decision Analysis Lecture 11 Tony Cox My e-mail: [email protected] Course web site: http://cox-associates.com/DA/ Agenda Recommended readings Problem set 9 solutions

Problem set 10 Where are we going? Wrap-up on multi-arm bandit (MAB) Thompson sampling Wrap-up on adaptive decisions Optimal stopping and optimal replacement Bayesian networks, CART, partial dependence plots Influence diagrams 2 Recommended Readings Skim Bell (1988), pages 1416-1418, on single-attribute utility theory http://www.people.hbs.edu/dbell/one%20switch.pdf Skim Abbas (2010), pages 62-67 and 7477, on multiattribute utility theory (MAUT) http://create.usc.edu/sites/default/files/publications/tutorialmau.pdf

3 Homework #9, Problem 1 Updating a uniform prior Starting from a uniform prior, U[0, 1], for success probability, you observe 22 successes in 30 trials. What is your Bayesian posterior probability that the success probability is greater than 0.5? Bayesian updating: P(p > 0.5 | x = 22, n = 30) = 1 - pbeta(0.5, 23, 9) = 0.994663 Uses pbeta with updated parameters x + 1 and n + 2 based on x successes in n trials, x = 22, n = 30 Posterior is beta(x + 1, n - x + 1). 4 Homework #9, Problem 2: Spare parts In a manufacturing plant, it costs $10/day to stock 1 spare

part, $20/day to stock 2 spare parts, etc. ($10 per spare part per day). There are 50 machines in the plant. Each machine breaks with probability 0.004 per machine per day. (More than one machine can fail on the same day.) If a spare part is available (in stock) when a machine breaks, it can be repaired immediately, and no production is lost. If no spare part is available when a machine breaks, it is idle until a new part can be delivered (1 day lag). $65 of production is lost. How many spare parts should the plant manager keep in stock to minimize expected loss? 5 Spare parts solution Number of failures in a day has binomial distribution with n = 50 machines and p = 0.004 per machine per day. Mean is np= 50*0.004 = 0.2 failures per day,

so we expect about one breakdown every 5 days The cost of keeping 1 spare is $10/day The cost per machine failure is $65 if no spare is available, else 0 6 Spare parts (cont.) With 0 spares, the average loss per day due to failures is ($65 per unrepaired failure)*(0.2 failures/day) = $13/day With 1 spare, average loss/day $10/day + $65*(1 - pbinom(1, 50, 0.004)) = $11.12 Exact cost < $ 65*dbinom(2,50,0.004) + 2*65*dbinom(3,50,0.004) + 3*65*dbinom(4,50,0.004) + sum(c(4:50))*65*dbinom(5,50,0.004) = $11.34 With 2 spares, average loss > $20/day

So, optimal decision is: Stock 1 spare 7 Homework #9 discussion problem for April 11 (uncollected/ungraded) Choice set: Take or Do Not Take Chance set (states): Sunshine or Rain P(Sunshine) = p = 0.6 Utilities of act-state pairs: u(Take, Sunshine) = 80 u(Take, Rain) = 80 u(Do Not Take, Sunshine) = 100 u(Do Not Take, Rain) = 0 8

Homework #9 discussion problem (uncollected/ungraded) 1. If p = 0.6, find EU (Take) and EU(Dont Take) using Netica Goal is to see how Netica deals with decisions and expected utilities May also try it via simulation 2. Update these EUs if a forecast (with error probability 0.2) predicts rain 9 The Umbrella Problem

Choice set: Take or Do Not Take Chance set (states): Sunshine or Rain P(Sunshine) = p = 0.6 Utilities of act-state pairs: u(Take, Sunshine) = 80 u(Take, Rain) = 80 u(Do Not Take, Sunshine) = 100 u(Do Not Take, Rain) = 0 10 Netica solution without forecast known Weather Sunshine 60.0 Rain 40.0 Take_Umbrella

Take 80.0000 Do Not Take 60.0000 forecast Rain predicted 44.0 Sunshine predicted 56.0 Consequence 11 Netica solution with forecast of rain Weather Sunshine 27.3

Rain 72.7 Take_Umbrella Take 80.0000 Do Not Take 27.2727 forecast Rain predicted 100 Sunshine predicted 0 Consequence

Netica automatically updates EU of different decisions, given observations. Influence diagrams are just as easy to create and solve as Bayesian networks 12 Solving by simulation in R: Part A (no forecast) # Initialize variables weather = utility = NULL; p = 0.6; n = 100000; # simulate weather state weather = rbinom(n, 1, p); # calculate and report simulated probability of sun psun = mean(weather); psun # calculate and report utilities of each action EUtake = psun*80+(1- psun)*80; EUleave = psun*100+(1- psun)*0; EUtake; EUleave 13

Solving by simulation in R: Part A (no forecast) > # Initialize variables > weather = utility = NULL; p = 0.6; n = 100000; > # simulate weather state > weather = rbinom(n, 1, p); > # calculate and report simulated probability of sun > psun = mean(weather); psun [1] 0.59519 > # calculate and report utilities of each action > EUtake = psun*80+(1- psun)*80; > EUleave = psun*100+(1- psun)*0; > EUtake; EUleave [1] 80 [1] 59.519 14 Solving by simulation in R Part B (with forecast)

# Initialize variables weather = forecastsun = utility = NULL; p = 0.6; n = 100000; # simulate weather states and forecasts weather = rbinom(n, 1, p); forecast_sun_if_rain = rbinom(n, 1, 0.2); forecast_sun_if_sun = rbinom(n,1, 0.8); forecastsun = weather*forecast_sun_if_sun + (1 - weather)* forecast_sun_if_rain; forecastRain=1-forecastsun # calculate and report desired conditional probability psunifForecastRain = sum(weather*forecastRain)/sum(forecastRain); psunifForecastRain # calculate and report utilities of each action EUtake = psunifForecastRain*80+(1- psunifForecastRain)*80; EUleave = psunifForecastRain*100+(1- psunifForecastRain)*0; EUtake; EUleave 15 Solving by simulation in R Part B (with forecast) > # Initialize variables

> weather = forecastsun = utility = NULL; p = 0.6; n = 100000; > # simulate weather states and forecasts > weather = rbinom(n, 1, p); forecast_sun_if_rain = rbinom(n, 1, 0.2); forecast_sun_if_sun = rbinom(n,1, 0.8); forecastsun = weather*forecast_sun_if_sun + (1 - weather)* forecast_sun_if_rain; forecastRain=1-forecastsun > # calculate and report desired conditional probability > psunifForecastRain = sum(weather*forecastRain)/sum(forecastRain); psunifForecastRain [1] 0.2764975 > # calculate and report utilities of each action > EUtake = psunifForecastRain*80+(1- psunifForecastRain)*80; EUleave = psunifForecastRain*100+(1- psunifForecastRain)*0; EUtake; EUleave [1] 80 [1] 27.64975 16 Homework #10 (optional)

(Due by 4:00 PM, April 18 if you want it graded) A deep sea oil drilling platform has a normally distributed lifetime (until failure) with a mean of 30 years and a standard deviation of 4 years. While it is operating, the platform produces oil worth $10M per year. Voluntarily stopping operations and closing down the platform costs $0. Having the platform fail while still in use leads to involuntary closure and a cost of $50M. At what age should the platform be voluntarily shut down (if it has not yet failed)? Hint: Continue until marginal benefit < expected marginal cost of

continuing Hint: Use a hazard function calculator for normally distributed lifetimes, e.g., http://reliabilityanalyticstoolkit.appspot.com/normal_distribution 17 Where are we going? Student observation 1: Here is my solution for homework 9. I have to say the course is really getting deeper and deeper. And I'm trying to understand all the content. Student observation 2: If I could totally understand and use a few of these techniques, it might be more useful than seeing more. (Exploration vs. exploitation trade-off.) Goals: Be able to use key techniques (e.g., computing EMV with known probabilities and values) that are relatively simple and useful. Be aware of more advanced methods and big ideas that may prove useful; know what to look for or ask for as a manager 18

Top-level ideas Many valuable decision problems can be solved using the philosophy of simulation-optimization: Try different decisions, evaluate their probable consequences, choose the one with best (EMV or EU-maximizing) probability distribution of consequences Both the probability modeling (simulation) and the search for a best decision or decision rule can become very technical For business applications, understanding what problems can be solved and how to formulate them is the most important part. (The rest is just software or using appropriate expertise) 19 MAB Thompson sampling (cont.)

20 Thompson sampling and adaptive Bayesian control: Bernoulli trials Basic idea: Choose each of the k actions according to the probability that it is best Estimate the probability via Bayes rule It is the mean of the posterior distribution Use beta conjugate prior updating for Bernoulli bandit (0-1 reward, fail/succeed) Sample from posterior for each arm, 1 k; choose the one with highest sample value. Update & repeat. S = success F = failure http://jmlr.org/proceedings/papers/v23/agrawal12/agrawal12.pdf Agrawal and Goyal, 2012 21

Thompson sampling: General stochastic (random) rewards Second idea: Generalize to arbitrary reward distribution (normalized to the interval [0, 1]) by considering a trial a success with probability equal to its reward http://jmlr.org/proceedings/papers/v23/agrawal12/agrawal12.pdf Agrawal and Goyal, 2012 22 Thompson sampling with complex online actions Main idea: Embed simulationoptimization in Thompson sampling loop = state space, S

Sample the states Applications: Job scheduling (assigning jobs to machines); web advertising with reward depending sets of ads shown Y = observation h = reward, X = random variable depending on Updating posteriors can be done efficiently using a sampling-based approach (particle filtering) Gopalan et al., 2014 http://jmlr.org/proceedings/papers/v32/gopalan14.pdf

23 Comparing methods In simulation experiments, Thompson sampling works well with batch updating, even with slowly or occasionally changing rewards and other realistic complexities. Beats UCB1 in many but not all comparisons More practical than UCB1 for batch updating because it keeps experimenting (trying

actions with some randomness) between updates. http://engineering.richrelevance.com/recommendations-thompson-sampling/ 24 MAB variations Contextual bandits See signal before acting Constrained contextual bandits: Actions constrained Adversarial bandits Adaptive adversaries Bubeck and Slivens, 2012, https://www.microsoft.com/en-us/research/wp-content/uploads/2017/01/COLT12_BS.pdf Restless bandits: Probabilities change

Gittins index maximizes expected discounted reward, not easy to compute Correlated bandits 25 Wrap-up on MAB problems Adaptive Bayesian learning works well in simple environments, including many of practical interest The resulting rules are *much* simpler to implement than previous methods (e.g., Gittins index policies) Sampling-based approaches (Thomposn, particle filtering, etc.) promote computationally practical online learning 26 Wrap-up on adaptive learning No need for a causal model

Learn act-consequence probabilities and optimal decision rules directly Assumes a stationary (or slowly changing) decision environment, known choice set, immediate feedback (reward) following action Works very well when these assumptions are met: low-regret learning is possible 27 Optimal stopping 28 Optimal stopping decision problems Suppose that a decision-maker (d.m.) faces a random sequence of opportunities How long to wait for best one?

When to stop and commit to a final choice? Examples: Selling a house, hiring a new employee, accepting a job offer, replacing a component, shuttering an aging facility, taking a parking spot, etc. Other optimal stopping problems: Least-cost policies for replacing aging components 29 Hazard functions: Conditional rate of failure given survival so far Let T = length of life for a component (or person, or time until first occurrence of an event, etc.) T is a random variable with cdf F(t) = Pr(T < t) and survival function S(t) = 1 F(t) = Pr(T > t) The pdf for T is then f(t) = F(t) = dF(t)/dt The hazard function for T is defined as: h(t) = limdt0Pr(t < T < t + dt | T > t)/dt

h(t) = f(t)/S(t) = f(t)/[1 F(t)] Interpretation: instantaneous failure rate h(t)dt Pr(occurs in next dt | survival until t) In discrete time, dt = 1, no limit is taken 30 Using hazard functions to guide decisions The shape of the hazard function can often guide decisions, e.g If h(t) is increasing, then optimal time to stop is when h(t) reaches a certain threshold If h(t) is decreasing, then best decision is either dont start or else continue until failure occurs Normal distribution hazard function calculator is at

http://reliabilityanalyticstoolkit.apps pot.com/normal_distribution SPRT and other calculators: http://reliabilityanalyticstoolkit.apps pot.com/ www.wolfram.com/mathematica/new-in-9/enhanced-probability-and-statistics/define-a-distri -given-its-hazard-function.html https://www.ncss.com/software/ncss/survival-analysis-in-ncss/ 31 Example: optimal age replacement The lifetime T of a component is a random variable with known distribution Suppose it costs $10 to replace the plant before it fails and $50 to replace it if it fails. When should the component be voluntarily replaced (if

not failed yet)? Answer can be calculated by minimizing expected average cost per cycle (or equating marginal benefit to marginal cost for continuing), but calculations are detailed and soon get tedious Alternative: Google optimal replacement age calculator 32 Optimal age replacement calculator http://www.reliawiki.org/index.php/Optimum_Replacement_Time_Example 33 Optimal selling of an asset If offers arrive sequentially from a known distribution and costs of waiting are known, then an optimal decision boundary (blue) can be constructed to maximize EMV

Sell when red line first hits blue decision boundary W(t) = price series S(t) = maximum price so far http://file.scirp.org/Html/9-1040163_25151.htm 34 Optimal stopping: Variations Offers arrive sequentially from an unknown distribution Bayesian updating provides solutions Time pressure: Must sell by deadline, or fixed number of offers With or without being able to go back to previous offers

Sell when blue line first hits green decision boundary http://file.scirp.org/Html/9-1040163_25151.htm 35 Wrap-up on optimal stopping and statistical decision theory Many valuable decision problems can be solved using the philosophy of simulation-optimization: Try different decisions, evaluate their probable consequences Choose the one with best (EMV or EU-maximizing) probability distribution of consequences Finding a best decision or decision rule can become very technical Use appropriate software or on-line calculators

For business applications, understanding how to formulate decision problems and solve them with software can create high value in practice 36 Introduction to descriptive analytics using informationbased algorithms (BN learning, CART trees, randomForest, partial dependence plots) 37 Descriptive analytics: Whats going on? What is the current situation? Attribution: How much harm/loss/opportunity cost is being caused by X?

Causes are often unobserved or uncertain What has changed recently? (Why?) Example: More extreme event reports caused by real change or by media? Change-point analysis (CPA) algorithms What should we worry about? How is this years season shaping up? 38 Air pollution example in CAT* Load data in Excel, click Excel to R to send it to R Los Angeles air basin 1461 days, 2007-2010 (Lopiano et al., 2015, thanks to Stan Young for data) PM2.5 data from CARB

Elderly mortality (AllCause75) from CA Department of Health Daily min and max temps & max relative humidity from ORNL and EPA Risk question: Does PM2.5 exposure increase elderly mortality risk? If so, by how much? (P(death | PM2.5) = ?) Causal Analytics Toolkit, http://cox-associates.com/downloads/ 39 Classification Trees A powerful, popular method for data mining, machine learning, and CPT estimation In R, partytree, ctree, rpart, and other algorithms provide CART (Classification and Regression Tree) algorithms Can download applet from: www.cs.ubc.ca/labs/lci/CIspace/dTree/

Basic idea: Always ask the most informative question next for reducing uncertainty (conditional entropy) about the dependent variable. Partitions a set of cases into groups or clusters (the leaf nodes) with similar conditional probabilities for dependent variable. 40 Air pollution-mortality example: Classification tree descriptive analytics tmin, tmax, month, year, MAXRH are identified as potential predictors of AllCause75 (elderly mortality) PM2.5 does not appear in this tree AllCause75 = elderly mortality count per day is conditionally

independent of PM2.5 in this analysis, given the other variables in the tree Making year and month into categorical variables changes the tree but not this conclusion. 41 How a CART tree works: Concept Basic idea: Always ask the most informative question next, given answers so far. Questions are represented by splits in tree Leaf nodes show conditional means (or conditional distributions) of dependent variable Internal nodes show significance level for split: how significant are differences between conditional distributions Can handle continuous, categorical, ordered categorical, and binary variables and missing values

Reduces prediction error for dependent variable Stop this recursive partitioning when further questions (splits in tree) do not significantly improve prediction. Classification & Regression Tree (CART) algorithm Which variables are informative in predicting dependent variable can depend on what other predictors are included. Taking out month makes PM2.5 informative. Some refinements: Grow a large tree and prune back to minimize cross-validation error fit multiple trees to random subsets of data and let them vote for best splits (bagging) over-train on mis-predicted cases (boosting) average predictions from many trees (RandomForest ensemble prediction) Join prediction patches together smoothly (MARS) 42

How to read a CART tree Tips of the tree (leaf nodes give the conditional expected value (for continuous variables) or conditional distributions (for discrete variables) of the dependent variable, given the value ranges of the variables along the path from the top of the tree (the root node) to the leaf node. The dependent variable is called y by default in the tree Branches (splits) show the value ranges being conditioned on The tips of the tree also show how many cases in the data set are described by the combination of values leading to that leaf node. These are the n values at the leaf nodes Example:

n = 92 days had tmin < 43 degrees and PM2.5 < 14 g/m3. An average of y = 141.696 elderly people per day died on days with this description. 43 Creating a CART tree in CAT 1. 2. Select dependent variable first, by clicking on a column heading in the Data sheet Select predictor columns using Ctrl + click

3. 4. 5. Can select in any order Click on desired analysis (Tree) Output appears on new tab (If unsure what analyses to do after columns selected, click on Analyze.) 44 Classification/Decision Tree Algorithms in Practice Different decision tree (= classification tree) algorithms use different splitting, stopping, and pruning criteria. Examples of tree algorithms: CHAID: chi-square automatic interaction detection

CART: Classification and regression trees, allows continuous as well as discrete variables MARS: Multiple adaptive regression splines, smooths the relations at tree-tips to fit together smoothly KnowledgeSeeker allows multiple splits, variable types ID3, C5.0, etc. 45 Recursively identify parents (or potential parents) for each node. Legend Q1_1_=_5? breakdown 0.000000 1.000000 total 40.2% 59.8% 1424

Q2jjj all questions answered ok ??? 1.000000 to 3.000000 42.0% 58.0% 976 92.0% 8.0% 75 4.000000 5.000000

70.7% 29.3% 99 8.8% 91.2% 274 Q3_b SA offered to demonstrate 1.000000 0.000000 30.6% 69.4% 709 72.3%

27.7% 267 Q8_b easy to use and deal with ??? 34.5% 65.5% 229 1.000000 70.4% 29.6% 27 2.000000 to 4.000000 37.8% 62.2%

254 Q8_b easy to use and deal with 5.000000 9.000000 8.2% 91.8% 182 47.1% 52.9% 17 ??? 64.5% 35.5%

62 1.000000 to 4.000000 85.6% 14.4% 153 5.000000 9.000000 42.3% 57.7% 52 46 Causal graph structure # 1 Q3h reviewed check list

Net1_3 Helped me choose products Q7_1 End-to-end experience NET1_5 Wait time ok Q3f offered to set up on-line billing Q3b offered to demonstrate Q2uu acknowledged when entered store Q2jjj questions answered Classification-tree Tests of CI In a classification tree, the dependent

variable (root node) is conditionally independent of all variables not in the tree, given the variables that are in the tree (at least as far as the tree-growing heuristic can discover). Starting with a childless node (output node), we can recursively seek direct parents of all nodes. 48