10-405 Big ML 1 Matrix Factorization 2 What is MF and what can you do with it? 3 Recovering latent factors in a matrix n rows m columns v1

1 vij vnm 4 Recovering latent factors in a matrix K*m y1 a1

a2 .. am x2 y2 b1 b2 bm

.. .. n*K x1 ~ v1 1 vij

vnm xn yn 5 What is this for? K*m y1 a1 a2 ..

am x2 y2 b1 b2 bm .. ..

n*K x1 ~ v1 1 vij vnm

xn yn 6 MF for collaborative filtering 7 What is collaborative filtering? Recovering latent factors in a matrix n users m movies v1

1 vij vnm V[i,j] = user is rating of movie j 9 Recovering latent factors in a matrix m movies m movies

y1 a1 a2 .. am x2 y2 b1 b2

bm .. .. n users x1 ~ v1 1

vij vnm xn yn V[i,j] = user is rating of movie j 10 11 12

Recovering latent factors in a matrix m movies m movies y1 a1 a2 .. am x2 y2

b1 b2 bm .. .. n users x1 ~ v1

1 vij vnm xn yn V[i,j] = user is rating of

movie j 13 MF for image modeling 14 15 PC1 MF for images 1000 images 1000 * 10,000,00 10,000 pixels 2 prototypes x1

y1 a1 a2 .. am x2 y2 b1 b2

bm .. .. ~ v1 1

vij PC2 xn yn vnm V[i,j] = pixel j in image i MF for modeling text 17

Recovering latent factors in a matrix The Neatest Little Guide to Stock Market Investing doc term matrix m terms Investing For Dummies, 4th Edition The Little Book of Common Sense Investing: The Only a1 Way a2 to..Guarantee am x1 y1 v1 1 Your Fair

Share of Stock Market Returns b1 b2 bm x2 y2 The Little Book of Value Investing .. .. Value Investing: From Graham to Buffett and Beyond Rich Dads Guide to Investing: What the vij

Rich Invest in, That the Poor and the Middle Class Do Not! Investing in Real Estate, 5th Edition Stock Investing For Dummies vnm Rich Dads Advisors: The ABCs of Real Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss V[i,j] = TFIDF score of term j in xn yn doc i https://technowiki.wordpress.com/2011/08/27/latent-semantic-analysis19 lsa-tutorial/ n documents

~ Recovering latent factors in a matrix dummy n documents stock saving advice ... doc term matrix m terms x1 y1 a1 a2

.. am x2 y2 b1 b2 bm ..

.. Doc = weighted sum of topics xn yn ~ v1 1

vij estate land invest rich vnm V[i,j] = TFIDF score of term j in doc i 20 Investing for real estate Rich Dads Advisors:

The ABCs of Real Estate Investment The little book of common sense investing: Neatest Little Guide to Stock Market Investing MF vs other learning tasks 23

MF is like linear regression 24 MF is like multiple-output multi-variable linear regression 1 = 1 2 = 2 = 25 Multi-output linear regression as MF examples m outputs for each xi n examples

m weight vectors x11 x12 a1 a2 x21 x22 b1 b2 .. ..

W .. am bm 1 = 1 2 = 2 X ~ v1

1 vij Y = vnm

xn1 yn . output 1 output 2 MF is like clustering 27 k-means Clustering Each point is in one cluster Each cluster is a weighte

d average of points centroid s 28 k-means as MF n examples indicators for r clusters 0 1 a1

a2 1 0 b1 b2 .. .. Z original data set cluster means M

.. am bm ~ v1 1

vij X vnm xn yn MF is soft clustering each example is a weighted sum of clusters K*m y1 a1

a2 .. am x2 y2 b1 b2 bm

.. .. n*K x1 ~ v1 1 vij

vnm xn yn 30 How do you do MF? 31 KDD 2011 talk pilfered from .. 32

33 Recovering latent factors in a matrix r x1 y1 a1 a2 x2 y2 b1 b2

.. .. n users m movies m movies W H .. am

bm ~ v1 1 vij V vnm

xn yn V[i,j] = user is rating of movie j 34 35 36 37 Matrix factorization as SGD step size why does this work? 38

Matrix factorization as SGD - why does this work? Heres the key claim: 39 Checking the claim Think for SGD for logistic regression LR loss = compare y and = dot(w,x) similar but now update w (user weights) and x (movie weight) 40 What loss functions are possible? 41 What loss functions are possible?

42 limited memory quasi-Newton ALS = alternating least squares 43 KDD 2011 talk pilfered from .. 44 iterative SGD, no mixing limited memory quasi-Newton param mixing alternating least squares IPM

45 Matrix factorization as SGD - why does this work? Heres the key claim: 46 47 48 W 1 H1 H2 H3 V 11 W 2

V V 33 Strata 1 12 W 2 22 W 3 W 1 H1 H2 H3 V

W 3 V 23 V 31 Strata 2 W 1 W 2 W 3 H1 H2 H3 V Node 1

13 Node 2 V Node 3 21 V 32 Strata 3 Epoch 1 49 iterative SGD, no mixing

limited memory quasi-Newton param mixing alternating least squares IPM 50 51 52 53 Hadoop scalability Hadoop process setup time starts to dominate 54

MF is like logistic regression 55 Linear regression as MF examples training data n examples weight vectors x11 x12 a1 a2 x21

x22 b1 b2 .. .. W .. am bm

1 = 1 2 = 2 X ~ v1 1 vij Y

= vnm xn1 yn . output 1 output 2 Logistic? regression as MF examples training data

n examples weight vectors x11 x12 a1 a2 x21 x22 b1 b2 ..

.. X W .. am bm ~ v1 1

vij Y vnm xn1 yn . output 1

output 2 Vectorizing logistic regression Many ML methods can be rewritten using nothing but vector-matrix operations (vectorizing) Why do this? Simpler (once you understand it well) Faster (given the right infrastructure - e.g., numpy, GPUs, ) Can simplify optimization (more later) 58 Vectorized minibatch logistic regression Computation wed like to vectorize: For each x in the minibatch, compute For each feature j: update w

using j 59 Vectorizing logistic regression Computation wed like to parallelize: For each x in the minibatch Xbatch, compute 1 1 hh = 1 [

1 1 = ][ ] [ ] 60

Vectorizing logistic regression Computation wed like to parallelize: For each x in the minibatch Xbatch, compute +1 [ ] in numpy if M is a matrix M+1 does the right thing so does M.exp(), M.dot(), 61 Vectorizing logistic regression

Computation wed like to parallelize: For each x in the minibatch, compute def logistic(X): return (X.exp() +1).reciprocal() p = logistic(Xb.dot(w)) # B rows, 1 column 62 Binary to softmax logistic regression 1 1 hh = 1 [

1 1 = ][ ] [ ]

63 Binary to softmax logistic regression exp ( ) exp ( ) X

XW 64 http://minpy.readthedocs.io/en/latest/get-started/logistic_regression Matrix multiply,; then exponentiate component-wise prob will have B rows and K columns, and each row will sum to 1 exp ( ) exp ( )

Sum the columns to get the denominator; keepdim=True means that this line will work correctly even though a and a_sum have different shapes XW 65 http://minpy.readthedocs.io/en/latest/get-started/logistic_regression 66 http://minpy.readthedocs.io/en/latest/get-started/logistic_regression 67

http://minpy.readthedocs.io/en/latest/get-started/logistic_regression x.T dy Error on each example x in batch and each class y python bug: should be x.T (transpose) The gradient step! 68 Vectorizing logistic regression Many ML methods can be rewritten

using nothing but vector-matrix operations (vectorizing) Why do this? Simpler (once you understand it well) Faster (given the right infrastructure - e.g., numpy, GPUs, ) Can simplify optimization (more later) 69