Predictive modeling with social networks Jennifer Neville & Foster Provost Tutorial at ACM SIGKDD 2008 Neville & Provost 2001-2008 Tutorial AAAI 2008 Social network data is everywhere Call networks Email networks Movie networks Coauthor networks Affiliation networks Friendship networks Organizational networks http://images.businessweek.com/ss/06/09/ceo_socnet/source/1.htm Neville & Provost 2001-2008 eMarketer projects that worldwide online social network ad spending will grow from $1.2 billion in 2007 to $2.2 2008,2001-2008 82%. billion Neville in
& Provost Neville & Provost 2001-2008 Modeling network data Descriptive modeling Social network analysis Group/community detection Predictive modeling Link prediction Attribute prediction our focus today Neville & Provost 2001-2008 Goal of this tutorial Our goal is not to give a comprehensive overview of relational learning algorithms (but we provide a long list of references and resources) Our goal is to present the main ideas that differentiate predictive inference and learning in (social) network data, example techniques that embody these ideas, results, from real applications if possible, and references and resources where you can learn more In three hours (less the break) we cannot hope to be comprehensive in our coverage of theory, techniques, or applications. We will present the most important concepts, illustrate with example techniques and
applications, and provide a long list of additional resources. Neville & Provost 2001-2008 The problem: Attribute Prediction in Networked Data To start, well focus on the following inference problem: For any node i, categorical variable yi, and value c, estimate p(yi = c|K) K is everything known about the network ? Neville & Provost 2001-2008 Macskassy & Provost (JMLR 2007) provide a broad treatment for univariate networks Lets start with a real-world example Neville & Provost 2001-2008 Example: Network targeting (Hill et al. 06)
Define Network Targeting (NT) cross between viral marketing and traditional targeted marketing from simple to sophisticated construct variable(s) to represent whether the immediate network neighborhood contains existing customers add social-network variables to targeting models, etc. (well revisit) then: target individuals who are predicted (using the social network) to be the best prospects simplest: target network neighbors of existing customers this could expand virally through the network without any word-of-mouth advocacy, or could take advantage of it. Example application: Product: new communications service Firm with long experience with targeted marketing Sophisticated segmentation models based on data, experience, and intuition e.g., demographic, geographic, loyalty data e.g., intuition regarding the types of customers known or thought to have affinity for this type of service Neville & Provost 2001-2008 Sales rates are substantially higher for network neighbors (Hill et al. 06) Relative Sales Rates for Marketing Segments
4.82 (1.35%) 2.96 (0.83%) 1 0.4 (0.28%) Non-NN 1-21 Neville & Provost 2001-2008 (0.11%) NN 1-21 NN 22 NN not targeted Firms increasingly are collecting data on explicit social networks of consumers Neville & Provost 2001-2008
Other applications Fraud detection Targeted marketing Bibliometrics Firm/industry classification Web-page classification Epidemiology Movie industry predictions Personalization Patent analysis Law enforcement Counterterrorism Neville & Provost 2001-2008 Neville & Provost 2001-2008 Neville & Provost 2001-2008 Outline of the tutorial: part I The basics contemporary examples of social network inference in action whats different about network data?
basic analysis framework (simple) predictive inference with univariate networks disjoint inference network linkage can provide substantial power for inference, if techniques can take advantage of relational autocorrelation inductive inference (learning) in network data disjoint learning models learn correlation among attributes of labeled neighbors in the network Note on terminology: In this tutorial, we use the term inference to refer to the making of predictions for variables unknown values, typically using a model of some sort. We use learning to denote the building of the model from data (inductive inference). Generally we use the terminology common in statistical machine learning. Note on acronyms: see reference guide at end of tutorial Neville & Provost 2001-2008 Outline of the tutorial: part II Moving beyond the basics collective inference network structure alone can provide substantial power for inference, if techniques can propagate relational autocorrelation inferred covariates can influence each other collective learning learning using both the labeled and unlabeled parts of the network, requires collective inference
social/data network vs. network of statistical dependencies throughout: example learning techniques example inference techniques example applications Additional topics (time permitting) methodology, evaluation, potential pathologies, understanding sources of error, other issues Neville & Provost 2001-2008 So, whats different about networked data? Neville & Provost 2001-2008 Data graph Neville & Provost 2001-2008 Unique characteristics of networked data Single data graph Partially labeled Widely varying link structure Often heterogeneous object and
link types From learning perspective: graph contains both training data and application/testing data Attribute dependencies Homophily; autocorrelation among variables of linked entities Correlation among attributes of linked entities Correlations between attribute values and link structure Neville & Provost 2001-2008 Suggest key techniques: guilt-by-association relational learning collective inference Relational autocorrelation Correlation between the values of the same variable on related objects PR = {(v i ,v j ) : eik1 ,ek1 k2 ,...,ekl j E R } Related instance pairs: Dependence between pairs of values of X: (x i , x j ) s.t. (v i ,v j ) PR +
+ + + + + + + + +
+ + + + + + + + +
High autocorrelation Low autocorrelation Neville & Provost 2001-2008 How can we incorporate autocorrelation into predictive inference? Neville & Provost 2001-2008 Disjoint inference (no learning) Use links to labeled nodes (i.e., guilt by association) Neville & Provost 2001-2008 Is guilt-by-association justified theoretically? Thanks to (McPherson, et al., 2001) Birds of a feather, flock together attributed to Robert Burton (1577-1640)
(People) love those who are like themselves -- Aristotle, Rhetoric and Nichomachean Ethics Similarity begets friendship -- Plato, Phaedrus Hanging out with a bad crowd will get you into trouble -- Fosters Mom Neville & Provost 2001-2008 Is guilt-by-association justified theoretically? Homophily fundamental concept underlying social theories (e.g., Blau 1977) one of the first features noticed by analysts of social network structure antecedents to SNA research from 1920s (Freeman 1996) fundamental basis for links of many types in social networks
(McPherson, et al., Annu. Rev. Soc. 2001) Patterns of homophily: remarkably robust across widely varying types of relations tend to get stronger as more relationships exist Now being considered in mathematical analysis of networks (assortativity, e.g., Newman (2003)) Does it apply to non-social networks? Neville & Provost 2001-2008 Relational autocorrelation is ubiquitous Marketing Product/service adoption among communicating customers (Domingos & Richardson 01, Hill et al 06) Fraud detection Fraud status of cellular customers who call common numbers (Fawcett & Provost 97, Cortes et al 01) Fraud status of brokers who work at the same branch (Neville & Jensen 05) Movies
Box-office receipts of movies made by the same studio (Jensen & Neville 02) Web Topics of hyperlinked web pages (Chakrabarti et al 98, Taskar et al 02) Neville & Provost 2001-2008 Biology Functions of proteins located in together in cells (Neville & Jensen 02) Tuberculosis infection among people in close contact (Getoor et al 01) Business Industry categorization of corporations that share common boards members (Neville & Jensen 00) Industry categorization of corporations that co-occur in news stories (Bernstein et al 03) Citation analysis
Topics of coreferent scientific papers (Taskar et al 01, Neville & Jensen 03) 35 K News stories Neville & Provost 2001-2008 Disjoint inference Relative Sales Rates for Marketing Segments 4.82 (1.35%) 2.96 (0.83%) 1 0.4 (0.28%) Non-NN 1-21 (0.11%) NN 1-21
NN 22 NN not targeted (Hill et al. 06) Neville & Provost 2001-2008 Example models of network data No learning Disjoint inference Collective inference Basic NT, wvRN Gaussian random fields, wvRN Disjoint learning ACORA, RBC, RPT, SLR MLN, RBN, RDN,
RMN Collective learning RBN w/EM, PL-EM, RGP Neville & Provost 2001-2008 -- What if we add in learning? Neville & Provost 2001-2008 Traditional learning and classification Logistic regression Neural networks Nave Bayes Classification trees SVMs Non-relational classif. Methods:
yi yj xi xj home location, main calling location, min of use, NYC,NYC,4350,3,5,yes,no,1,0,0,1,0,2,3,0,1,1,0,0,0,.. NYC,BOS,1320,2,no,no,1,0,0,0,0,1,5,1,7,6,7,0,0,1, BOS,BOS,6543,5,no,no,0,1,1,1,0,0,0,0,0,0,4,3,0,4,.. ... Neville & Provost 2001-2008 Network learning and classification Structural logistic regression Relational nave Bayes Relational probability trees Relational SVMs Non-relational classif.
Methods: Network classification yi xi Relations yj xj home location, main calling location, min of use, NYC,NYC,4350,3,5,yes,no,1,0,0,1,0,2,3,0,1,1,0,0,0,.. NYC,BOS,1320,2,no,no,1,0,0,0,0,1,5,1,7,6,7,0,0,1, BOS,BOS,6543,5,no,no,0,1,1,1,0,0,0,0,0,0,4,3,0,4,.. ... Neville & Provost 2001-2008 Relational learning Learning where data cannot be represented as a single relation/table of independently distributed entities, without losing important information Data may be represented as: a multi-table relational database, or a heterogeneous, attributed graph, or a first-order logic knowledge base
There is a huge literature on relational learning and it would be impossible to do justice to it in the short amount of time we have For additional information, see: Pointers/bibliography on tutorial page International Conference on Inductive Logic Programming Cussens & Kerstings ICML04 tutorial: Probabilistic Logic Learning Getoors ICML06/ECML07 tutorials: Statistical Relational Learning Domingoss KDD07/ICML07 tutorials: Statistical Modeling of Relational Data Literature review in Macskassy & Provost JMLR07 Neville & Provost 2001-2008 Disjoint learning: part I Create (aggregate) features of labeled neighbors (Perlich & Provost KDD03) treat aggregation Neville & Provost 2001-2008 and relational learning feature construction Disjoint learning of relational models
+ Create aggregate features of relational information Consider local relational neighborhood around instances + + +
+ Actor Color Award Studio Rating Location Max Length Produc Exp
+ Drama 11 B&W Y CA R 96m 5yr + 42 Action Y Color NY R 25yr 163m
- Comedy 19 Color N PG-13 CA 107m 13yr Learn (adapted) flat model Y +
+ + N + + Actor Genre Count + Receipt Receipt s >$2mil >$2mil
Neville & Provost 2001-2008 Y N Y N Social network features can be created for flat models y = f (...xG ...) where xG is a (vector of) network-based feature(s) Example applications: Fraud detection construct variables representing connection to known fraudulent accounts (Fawcett & Provost 97) or the similarity of immediate network to known fraudulent accounts (Cortes et al. 01; Hill et al. 06b) Marketing (Hill et al. 06a) Creation of SN features can be (more or less) systematic: (Popescul & Ungar 03; Perlich & Provost 03,06; Karamon et al. 07,08; Gallagher & Eliassi-Rad 08)
Compare with developing kernels for relational data (e.g., Gartner03) Also: Ideas from hypertext classification extend to SN modeling: hypertext classification has text + graph structure construct variables representing (aggregations of) the classes of linked pages/ documents (Chakrabarti et al. 98; Lu & Getoor 03) formulate as regularization/kernel combination (e.g., Zhang et al. KDD06) General survey of web page classification: (Qi & Davison, 2008) Neville & Provost 2001-2008 Example Structural logistic regression Features Based on boolean first-order logic features used in inductive logic programming Top-down search of refinement graph Includes additional database aggregators that result in scalar values (e.g. count, max) Model Logistic regression Two-phase feature selection process with AIC/BIC
Neville & Provost 2001-2008 (Popescul et al. 03) Example Relational probability trees Features Uses set of aggregators to construct features (e.g., Size, Average, Count, Proportion) Exhaustive search within a user-defined space (e.g., <3 links away) Model Decision trees with probability estimates at leaves Pre-pruning based on chisquare feature scores Randomization tests for accurate feature selection (more on this later) Neville & Provost 2001-2008 (Neville et al. 03) Recall the network marketing
example Neville & Provost 2001-2008 Learning patterns of labeled nodes Features can be constructed that represent guilt of a nodes neighbors: y = f (...xG ...) where xG is a (vector of) network-based feature(s) Example application: Marketing (Hill et al. 06a) Neville & Provost 2001-2008 Network features that model known customers Attribute Description Degree Number of unique customers communicated with before the mailer # Transactions
Number of transactions to/from customers before the mailer Seconds of communication Number of seconds communicated customers before mailer Connected to influencer ? Is an influencer in your local neighborhood? Connected component s ize Size of the connected component target belongs to. Similarity (structural equivalence) Max overlap in local neighborhood with existing customer Neville & Provost 2001-2008
with Lift in sales with network-based features 1 0.8 0.6 0.4 All traditional variables 0.2 Cumulative % of Sales "All + net" traditional + network 0 0 0.2 0.4 0.6 0.8 1
Cumulative % of Consumers Targeted (Ranked by Predicted Sales) Neville & Provost 2001-2008 Similar results for predicting customer attrition/churn Thanks to KXEN see also&(Dasgupta et al. EDBT08) & (Birke 08) on social networks & telecom churn Neville Provost 2001-2008 Similar results for predicting customer attrition Thanks to KXEN Neville & Provost 2001-2008 Neville & Provost 2001-2008 Disjoint learning: part II Use node identifiers in features connections to specific individuals can be telling
Neville & Provost 2001-2008 Side note: not just for networked data IDs are important for any data in a multi-table RDB Neville & Provost 2001-2008 Towards a theory of aggregation (Perlich & Provost MLJ06): A (recursive) Bayesian perspective Traditional (nave) Bayesian Classification: P(c|X)=P(X|c)*P(c)/P(X) P(X|c)= P(x |c) i i P(xi|c) & P(c) Bayes Rule Assuming conditional independence Estimated from the training data Linked Data: xi might be an object identifier (e.g. SSN) => P(xi|c) cannot be estimated
Let be a set of k objects linked to xi => P(xi|c) ~ P(linked-to-i|c) P(i|c) ~ P(i|c) ~ P(O|c) O ( O Assume O is drawn independently P(o |c)) j Neville & Provost 2001-2008 j Assuming conditional independence How to incorporate identifiers of related objects (in a nutshell) 1. Estimate from known data:
class-conditional distributions of related identifiers (say D+ & D-) can be done, for example, assuming class-conditional independence in analogy to Nave Bayes save these as meta-data for use with particular cases 2. Any particular case C has its own distribution of related identifiers (say Dc) 3. Create features (Dc,D+ ), (Dc, D- ), ((Dc, D+ ) (Dc, D-)) where is a distance metric between distributions 4. Add these features to target-node description(s) for learning/estimation Main idea: Is the distribution of nodes to which this case is linked similar to that of a
A B CID Class CID id C1 0 C1 B C2 1 C2 A DClass 1
0.75 0.25 C3 1 C2 A DClass 0 0.2 0.8 C4 0 C2 B C3 A
C4 B C4 B C4 B C4 A 2: Case linkage distributions: Dc A B C1 0
1 C2 0.66 0.33 C3 1 0 C4 0.25 0.75 ? 4: Extended feature vector: CID L21 CID L2 ? 0
3: L2 distances for C1: C1 C1 1.125 0.08 L2(C1, DClass 1) = 1.125 L2(C1, DClass 0) = 0.08 C2 C2 0.014 0.435 0.421 1 1 C3 C3 0.125 1.28
1.155 1 1 C4 0.5 C4 0.005 Neville & Provost 2001-2008 ? L2 Class ...1- L20 Class -1.045 0 0 -0.495 0 0 A snippet from an actual network including bad guys nodes are people links are communications red nodes are fraudsters Dialed-digit detector (Fawcett & P., 1997) Communities of Interest (Cortes et al. 2001) these two bad guys are
well connected Neville & Provost 2001-2008 Classify buyers of most-common title from a Korean E-Book retailer Estimate whether or not customer will purchase E-Books the most-popular e-book: Accuracy=0.98 (AUC=0.96) 0.06 0.05 0.04 Class 1 0.03 Class 0 0.02 Conditional Prior 0.01 0 1 2 3
4 5 6 7 8 9 10 Class-conditional distributions across identifiers of 10 other popular books Neville & Provost 2001-2008 Watch for more results later Models of network data No learning Disjoint inference Collective inference
Basic NT, wvRN Gaussian random fields, wvRN Disjoint learning NT, ACORA, RBC, RPT, SLR MLN, RBN, RDN, RMN Collective learning RBN w/EM, PL-EM, RGP Neville & Provost 2001-2008 -- Another unique characteristic of networked data: one can perform collective inference Neville & Provost 2001-2008
Collective inference Use links among unlabeled nodes Neville & Provost 2001-2008 Collective inference models A particularly simple guiltby-association model is that a values probability is the average of its probabilities at the neighboring nodes 1 p( yi =c | N i ) = Z w i, j p( y j =c | N j ) v j N i ? Gaussian random field (Besag 1975; Zhu et al. 2003) Relational neighbor classifier - wvRN (Macskassy & P. 2003) Neville & Provost 2001-2008
Model partially-labeled network with a random field Treat network as a random field a probability measure over a set of random variables {X 1, , Xn} that gives non-zero probability to any configuration of values for all the variables. Convenient for modeling network data: A Markov random field satisfies p ( X i = xi X j = x j , i j ) = p ( X i = xi N i ) where Ni is the set of neighbors of Xi under some definition of neighbor. in other words, the probability of a variable taking on a value depends only on its neighbors probability of a configuration x of values for variables X the normalized product of the potentials of the states of the k maximal cliques in the network: 1 P( X = x) = k ( x(k ) ) Z k Neville & Provost 2001-2008 (Dobrushin, 1968; Besag, 1974; Geman and Geman, 1984) Markov random fields
Random fields have a long history for modeling regular grid data in statistical physics, spatial statistics, image analysis see Besag (1974) Besag (1975) applied such methods to what we would call networked data (non-lattice data) Some notable contemporary example applications: web-page classification (Chakrabarti et al. 1998) viral marketing (Domingos & Richardson 2001, R&D 2002) eBay auction fraud (Pandit et al. 2007) Neville & Provost 2001-2008 Collective inference cartoon relaxation labeling repeatedly estimate class distributions on all unknowns, based on current estimates b ? b ? b ? ? b
b ? b b b ? Neville & Provost 2001-2008 ? ? b ? ? b 1 p( yi =c | N i ) = Z
w i, j v j N i p( y j =c | N j ) Collective inference cartoon relaxation labeling repeatedly estimate class distributions on all unknowns, based on current estimates b ? b ? b ? ? b b ?
b b b ? Neville & Provost 2001-2008 ? ? b ? ? b 1 p( yi =c | N i ) = Z w i, j
v j N i p( y j =c | N j ) Collective inference cartoon relaxation labeling repeatedly estimate class distributions on all unknowns, based on current estimates b ? b ? b ? ? b b ? b b
b ? Neville & Provost 2001-2008 ? ? b ? ? b 1 p( yi =c | N i ) = Z w i, j v j N i p( y j =c | N j )
Collective inference cartoon relaxation labeling repeatedly estimate class distributions on all unknowns, based on current estimates b ? b ? b ? ? b b ? b b b ?
Neville & Provost 2001-2008 ? ? b ? ? b 1 p( yi =c | N i ) = Z w i, j v j N i p( y j =c | N j ) Collective inference cartoon relaxation labeling repeatedly estimate class distributions on all unknowns, based on current estimates
b b b b b b b b b b Neville & Provost 2001-2008 1 p( yi =c | N i ) = Z w i, j
v j N i p( y j =c | N j ) Various techniques for collective inference (see also Jensen et al. KDD04) MCMC, e.g., Gibbs sampling (Geman & Geman 1984) Iterative classification (Besag 1986; ) Relaxation labeling (Rosenfeld et al. 1976; ) Loopy belief propagation (Pearl 1988) Graph-cut methods (Greig et al. 1989; ) Either: estimate the maximum a posteriori joint probability distribution of all free parameters or estimate the marginal distributions of some or all free parameters simultaneously (or some related likelihood-based scoring) or
just perform a heuristic procedure to reach a consistent state. Neville & Provost 2001-2008 Models of network data No learning Disjoint inference Collective inference Basic NT, wvRN Gaussian random fields, wvRN Disjoint learning NT, ACORA, RBC, RPT, SLR MLN, RBN, RDN, RMN Collective learning
RBN w/EM, PL-EM, RGP Neville & Provost 2001-2008 -- Using the relational neighbor classifier and collective inference, we can ask: How much information is in the network structure alone? Neville & Provost 2001-2008 Network classification case study 12 data sets from 4 domains (previously used in ML research) IMDB (Internet Movie Database) (e.g., Jensen & Neville, 2002) Cora (e.g., Taskar et al., 2001) [McCallum et al., 2000] WebKB [Craven et al., 1998] CS Depts of Texas, Wisconsin, Washington, Cornell
multiclass & binary (student page) cocitation links Industry Classification [Bernstein et al., 2003] yahoo data, prnewswire data Homogeneous nodes & links one type, different classes/subtypes Univariate classification only information: structure of network and (some) class labels guilt-by-association (wvRN) with collective inference plus several models that learn relational patterns Neville & Provost 2001-2008 Macskassy, S. and F. P. "Classification in Networked Data: A toolkit and a univariate case study." Journal of Machine Learning Research 2007. Local models to use for collective inference (see Macskassy & Provost JMLR07) network-only Bayesian classifier nBC inspired by (Charabarti et al. 1998) multinomial nave Bayes on the neighboring class labels
network-only link-based classifier inspired by (Lu & Getoor 2003) logistic regression based on a nodes distribution of neighboring class labels, DN(vi) (multinomial over classes) relational-neighbor classifier (weighted voting) (Macskassy & Provost 2003, 2007) 1 p ( yi =c | N i ) = Z w i, j p ( y j =c | N j ) v j N i relational-neighbor classifier (class distribution) Inspired by (Perlich & Provost 2003) p ( yi =c | N i ) =sim( DN (vi ), Dist (c)) Neville & Provost 2001-2008 How much information is in the network structure? Data set
Accuracy Relative error reduction over default prediction wisconsin-student 0.94 86% texas-student 0.93 86% Cora 0.87 81% wisconsin-multi 0.82 67%
cornell-student 0.85 65% imdb 0.83 65% wash-student 0.85 58% wash-multi 0.71 52% texas-multi 0.74
50% industry-yahoo 0.64 49% cornell-multi 0.68 45% industry-pr 0.54 36% Neville & Provost 2001-2008 Labeling 90% of nodes Classifying remaining 10% Averaging over 10 runs Machine learning research papers (from CoRA dataset)
prob meth. (yellow) theory (green) genetic algs (red) rule learning (blue) neural nets (pink) RL (white) case-based (orange) Neville & Provost 2001-2008 RBN vs wvRN Neville & Provost 2001-2008 (Macskassy & Provost 07) Using identifiers Neville & Provost(compare: 2001-2008Hill (Perlich & Provost 06) & P. The Myth of the Double-Blind Review, 2003) A counter-terrorism application Poor concentration for primary-data only (iteration 0) High concentration after one secondary-access phase (iteration 1) rightmost
people are completely unknown, therefore ranking is uniform (Macskassy & P., Intl. Conf. on Intel. Analysis 2005) most suspicious high concentration of bad guys at top of suspicion ranking gets better with increased secondary-data access Neville & Provost 2001-2008 5046 is moderately noisy: of known bad guys were mislabeled Characteristics of network data Single data graph Partially labeled Widely varying link structure Often heterogeneous object and link types Attribute dependencies Homophily, autocorrelation among class labels
Correlation among attributes of related entities Correlations between attribute values and link structure Neville & Provost 2001-2008 Networks graphs? Networked data can be much more complex than just sets of (labeled) vertices and edges. Vertices and edges can be heterogeneous Vertices and edges can have various attribute information associated with them Various methods for learning models that take advantage of both autocorrelation and attribute dependencies Probabilistic relational models (RBNs, RMNs, AMNs, RDNs, ) Probabilistic logic models (BLPs, MLNs, ) Neville & Provost 2001-2008 Models of network data No learning Disjoint inference
Collective inference wvRN Gaussian random fields, MRFs, wvRN Disjoint learning ACORA, RBC, RPT, SLR MLN, RBN, RDN, RMN Collective learning RBN w/EM, PL-EM, RGP Neville & Provost 2001-2008 -- Disjoint learning: part III Assume training data are fully labeled, model dependencies among linked entities
Neville & Provost 2001-2008 Relational learning Lets consider briefly three approaches Model with inductive logic programming (ILP) Model with probabilistic relational model (graphical model+RDB) Model with probabilistic logic model (ILP+probabilities) Neville & Provost 2001-2008 First-order logic modeling The field of Inductive Logic Programming has extensively studied modeling data in first-order logic Although it has been changing, traditionally ILP did not focus on representing uncertainty in the usual use of first-order logic, each ground atom either is true or is not true (cf., a Herbrand interpretation) one of the reasons for the modern rubric statistical relational learning First-order logic for statistical modeling of network data? a strength is its ability to represent and facilitate the search for complex and deep patterns in the network a weakness is its relative lack of support for aggregations across nodes (beyond existence) more on this in a minute
Neville & Provost 2001-2008 Network data in first-order logic broker(Amit), broker(Bill), broker(Candice), works_for(Amit, Bigbank), works_for(Bill, E_broker), works_for(Candice, Bigbank), married(Candice, Bill) smokes(Amit), smokes(Candice), works_for(X,F) & works_for(Y,F) -> coworkers(X,Y) smokes(X) & smokes(Y) & coworkers(X,Y) -> friends(X,Y) Bill coworkers Amit Candice friends Neville & Provost 2001-2008 married Whats the problem with using FOL for our task? Probabilistic graphical models
Probabilistic graphical models (PGMs) are convenient methods for representing probability distributions across a set of variables. Bayesian networks (BNs), Markov networks (MNs), Dependency networks (DNs) See Pearl (1988), Heckerman et al. (2000) Typically BNs, MNs, DNs are used to represent a set of random variables describing independent instances. For example, the probabilistic dependencies among the descriptive features of a consumerthe same for different consumers Neville & Provost 2001-2008 Example A Bayesian network modeling consumer reaction to new service Technical sophistication Positive reaction before trying service lead user characteristics income
Quality sensitivity Amount of use Neville & Provost 2001-2008 Positive reaction after trying service Probabilistic relational models The term relational recently has been used to distinguish the use of probabilistic graphical models to represent variables across a set of dependent, multivariate instances. These methods model the full joint distribution over the attribute values in a network using a probabilistic graphical model (e.g., BN, MN) For example, the dependencies between the descriptive features of friends in a social network We saw a relational Markov network earlier when we discussed Markov random fields for univariate network data although the usage is not consistent, Markov random field often is used for a MN over multiple instances of the same variable In these probabilistic relational models, there are dependencies within instances and dependencies among instances Key ideas: Learn from a single network by tying parameters across instances of same type
Use aggregations to deal with heterogeneous network structure Neville & Provost 2001-2008 Modeling the joint network distribution Relational Bayesian networks Extend Bayes nets to network settings (Friedman et al. 99, Getoor et al. 01) Efficient closed form parameter estimation, but acyclicity constraint limits representation of autocorrelation dependencies and makes application of guilt-by-association techniques difficult Relational Markov networks Extension of Markov networks (Taskar et al 02) No acyclicity constraint but feature selection is computationally intensive because parameter estimation requires approximate inference Associative Markov networks are a restricted version designed for guilt-byassociation settings, for which there are efficient inference algorithms (Taskar et al. 04) Relational dependency networks Extension of dependency networks (Neville & Jensen 04) No acyclicity constraint, efficient feature selection, but model is an approximation of the full joint and accuracy depends on size of training set Neville & Provost 2001-2008 Example: Can we estimate the likelihood that a stock broker is/will be engaged in activity that violates securities regulations?
Neville & Provost 2001-2008 Neville & Provost 2001-2008 Detecting bad brokers for (Neville et al. KDD05) NASD (now FINRA) is the largest private-sector securities regulator + NASDs mission includes preventing and discovering misconduct among brokers (e.g., fraud) Current approach: Handcrafted rules that target brokers with a history of misconduct (HRB) Task: Use relational learning techniques to automatically identify brokers likely to engage in misconduct based on network patterns +
+ + +
+ + + + + Disclosure Branch Neville & Provost 2001-2008
+ Broker Bad* Broker *Bad = having violated securities regulations Neville & Provost 2001-2008 RPT identified additional brokers to target (Neville et al. KDD05) One broker I was highly confident in ranking as 5 Not only did I have the pleasure of meeting him at a shady warehouse location, I also negotiated his bar from the industry... This person actually used investors' funds to pay for personal expenses including his trip to attend a NASD compliance conference! If the model predicted this person, it would be right on target. Informal examiner feedback Neville & Provost 2001-2008
Relational dependency networks (Neville & Jensen JMLR07) + + + + CoWorker
Count(IsFraud)>1 CoWorker Count(IsFraud)>1 CoWorker Count(IsFraud)>1 CoWorker Disclosure + + + + +
Learn statistical dependencies among variables + Count(IsFraud)>3 Count(Yr<1995)>3 CoWorker Disclosure Count(IsFraud)>3 Count(Yr<1995)>3 CoWorker Disclosure Count(IsFraud)>3 Count(Yr<1995)>3 Disclosure Disclosure CoWorker Count(Yr<2000)>0 Avg(Yr)>1997 Count(IsFraud)>0 Disclosure Disclosure CoWorker Count(Yr<2000)>0
Count(IsFraud)>0 Avg(Yr)>1997 Disclosure Disclosure CoWorker Count(Yr<2000)>0 Avg(Yr)>1997 Count(IsFraud)>0 Disclosure Max(Yr)>1996 Disclosure Max(Yr)>1996 Disclosure Max(Yr)>1996 Construct local dependenc y network Branch 1 Region 1 Area 1 Broker1
Is Frau d1 Broker2 Has Business1 Is Frau d2 Has Business2 On Watch1 Broker Broker3 Is Frau d3 On
Watch2 On Watch3 Disclosure1 Disclosure2 Disclosure 3 Year1 Year2 Year3 Type Type Type 1 2 3
Disclosure Branch Has Business3 Neville & Provost 2001-2008 Unroll over particular data network for (collective) inference Has Business Year Region On Watch Type Area Is
Fraud Data on brokers, branches, disclosures Broker Disclosure Branch Has Business Year Region On Watch Type Area Is Fraud Neville & Provost 2001-2008
Learned RDN for broker variables (Neville & Jensen JMLR07) Broker Disclosure Branch Has Business Year Region On Watch Type Area Is Fraud note: needs to be unrolled across network Neville & Provost 2001-2008
Important concept! The network of statistical dependencies does not necessarily correspond to the data network Example on next three slides Neville & Provost 2001-2008 Recall: broker dependency network Broker Disclosure Branch Has Business Year Region On Watch Type Area
Is Fraud note: this dependency network needs to be unrolled across the data network Neville & Provost 2001-2008 Broker data network Disclosure Statistical dependencies between brokers jump across branches; similarly for disclosures Broker + Bad* Broker Branch + +
+ + +
+ + Neville & Provost 2001-2008 *Bad = having violated securities regulations + + + Model unrolled on (tiny) data network
(three brokers, one branch) Branch1 Region1 Area1 Broker1 Broker2 Has Business1 Is Fraud1 Has Business2 Is Fraud2 On Watch1 Neville & Provost 2001-2008 Broker3
Has Business3 Is Fraud3 On Watch2 On Watch3 Disclosure1 Disclosure2 Disclosure3 Year1 Year2 Year3 Type1 Type2 Type3
Putting it all together: Relational dependency networks + + + + CoWorker
Count(IsFraud)>1 CoWorker Count(IsFraud)>1 CoWorker Count(IsFraud)>1 CoWorker Disclosure + + + + +
Learn statistical dependencies among variables + Count(IsFraud)>3 Count(Yr<1995)>3 CoWorker Disclosure Count(IsFraud)>3 Count(Yr<1995)>3 CoWorker Disclosure Count(IsFraud)>3 Count(Yr<1995)>3 Disclosure Disclosure CoWorker Count(Yr<2000)>0 Avg(Yr)>1997 Count(IsFraud)>0 Disclosure Disclosure CoWorker Count(Yr<2000)>0
Count(IsFraud)>0 Avg(Yr)>1997 Disclosure Disclosure CoWorker Count(Yr<2000)>0 Avg(Yr)>1997 Count(IsFraud)>0 Disclosure Max(Yr)>1996 Disclosure Max(Yr)>1996 Disclosure Max(Yr)>1996 Construct local dependenc y network Branch 1 Region 1 Area 1 Broker1
Is Frau d1 Broker2 Has Business1 Is Frau d2 Has Business2 On Watch1 Broker Broker3 Is Frau d3 On
Watch2 On Watch3 Disclosure1 Disclosure2 Disclosure 3 Year1 Year2 Year3 Type Type Type 1 2 3
Disclosure Branch Has Business3 Neville & Provost 2001-2008 Unroll over particular data network for (collective) inference Has Business Year Region On Watch Type Area Is
Fraud Combining first-order logic and probabilistic graphical models Recently there have been efforts to combine FOL and probabilistic graphical models e.g., Bayesian logic programs (Kersting and de Raedt 01), Markov logic networks (Richardson & Domingos MLJ06) and see discussion & citations in (Richardson & Domingos 06) For example: Markov logic networks A template for constructing Markov networks Therefore, a model of the joint distribution over a set of variables A first-order knowledge base with a weight for each formula Advantages: Markov network gives sound probabilistic foundation First-order logic allows compact representation of large networks and a wide variety of domain background knowledge Neville & Provost 2001-2008 Markov logic networks (Richardson & Domingos MLJ06) A Markov Logic Network (MLN) is a set of pairs (F, w): F is a formula in FOL
w is a real number Together with a finite set of constants, it defines a Markov network with: One node for each grounding of each predicate in the MLN One feature for each grounding of each formula F in the MLN, with its corresponding weight w Neville & Provost 2001-2008 1.5 x Smokes(x) Cancer(x) 1.1 x, y Friends(x, y) (Smokes(x) Smokes(y)) See Domingos KDD07 tutorial Statistical Modeling of Relational Data for more details
MLN details Two constants: Anna (A) and Bob (B) Friends(A,A) 1.1 Friends(A,B) 1.1 Smokes(A) 1.5 Smokes(B) 1.1 Cancer(A) 1.1 Friends(B,B) 1.5 Cancer(B) Friends(B,A)
1.5 x Smokes(x) Cancer(x) 1.1 x, y Friends(x, y) ( Smokes(x) Smokes(y)) Neville & Provost 2001-2008 1 P ( x) = exp wi ni ( x) Z i wi: weight of formula i ni(x): # true groundings of formula i in x Why learning collective models improves classification (Jensen et al. KDD04) Why learn a joint model of class labels? Could use correlation between class labels and observed attributes on related instances instead But modeling correlation
among unobserved class labels is a low-variance way of reducing model bias Collective inference achieves a large decrease in bias at the cost of a minimal increase in variance Neville & Provost 2001-2008 Comparing collective inference models (Xiang & Neville SNA-KDD08) Learning helps when autocorrelation is low and there are other attributes dependencies Learning helps when linkage is low and labeling is plentiful Neville & Provost 2001-2008 Global vs. local autocorrelation MLN/RDN/RMN: exploit global autocorrelation learning implicitly assumes training and test set are disjoint assumes autocorrelation is stationary throughout graph
ACORA with identifiers (Perlich & Provost MLJ06) exploits local autocorrelation relies on overlap between training and test sets need sufficient data locally to estimate What about a combination of the two? (open question) Neville & Provost 2001-2008 Autocorrelation is non-stationary Cora: topics in coauthor graph Neville & Provost 2001-2008 IMDb: receipts in codirector graph Shrinkage models Neville & Provost 2001-2008 (Angin & Neville SNA-KDD 08)
Recall our network-based marketing example? collective inference can help for the nodes that are not neighbors of existing customers identify areas of the social network that are dense with customers Neville & Provost 2001-2008 For targeting consumers, collective inference gives additional improvement, especially for non-network neighbors (Hill et al. 07) Predictive Performance (Area under ROC curve/ Mann-Whitney Wilcoxon stat) Model (network only) NN non-NN All first-order network variables 0.61 0.71 All first-order + oracle (wvRN)
0.63 0.74 All first-order + collective inference* (wvRN) 0.63 0.75 Predictive Performance (Area under ROC curve/ Mann-Whitney Wilcoxon stat) Model (with traditional variables) NN non-NN All traditional variables 0.68 0.72 All trad + local network variables 0.69
0.73 All trad + local network + collective inference* 0.72 0.77 * with network sampling and pruning Neville & Provost 2001-2008 Models of network data No learning Disjoint inference Collective inference wvRN Gaussian random fields, wvRN Disjoint learning ACORA, RBC,
RPT, SLR MLN, RBN, RDN, RMN Collective learning RBN w/EM, PL-EM, RGP Neville & Provost 2001-2008 -- Collective learning Consider links among unlabeled entities during learning Neville & Provost 2001-2008 Semi-supervised learning To date network modeling techniques have focused on 1. exploiting links among unlabeled entities (i.e., collective inference) 2. exploiting links between unlabeled and labeled for inference (e.g., identifiers) Can we take into account links between unlabeled and labeled during learning?
Large body of related work on semi-supervised and transductive learning but this has dealt primarily with i.i.d. data Exceptions: PRMs w/EM (Taskar et al. 01) Relational Gaussian Processes (Chu et al. 06) Open: there has been no systematic comparison of models using different representations for learning and inference Neville & Provost 2001-2008 Recall: RBN vs wvRN RBN+EM Neville & Provost 2001-2008 Pseudolikelihood-EM (Xiang & Neville KDD-SNA 08) General approach to learning arbitrary autocorrelation dependencies in within-network domains Combines RDN pseudolikelihood approach with mean-field approximate inference to learn a joint model of labeled and unlabeled instances Algorithm
1. Learn an initial disjoint local classifier (with pseudolikelihood estimation) using only labeled instances 2. For each EM iteration: E-step: apply current local classifier to unlabeled data with collective inference, use current expected values for neighboring labels; obtain new probability estimates for unlabeled instances; M-step: re-train local classifier with updated label probabilities on unlabeled instances. Neville & Provost 2001-2008 Comparison with other network models Collective learning improves performance when: (1) labeling is moderate, or (2) when labels are clustered in the network Neville & Provost 2001-2008 Or when Learning helps when autocorrelation is low and there are other attributes dependencies Learning helps when linkage is low and labeling is plentiful
Neville & Provost 2001-2008 Models of network data No learning Disjoint inference Collective inference wvRN Gaussian random fields, wvRN Disjoint learning ACORA, RBC, RPT, SLR MLN, RBN, RDN, RMN Collective learning RBN w/EM,
PL-EM, RGP Neville & Provost 2001-2008 -- Collective learning, disjoint inference Use unlabeled data for learning, but not for inference Open: No current methods do this However, disjoint inference is much more efficient May want to use unlabeled data to learn disjoint models (e.g., infer more labels to improve use of identifiers) Neville & Provost 2001-2008 Recap Disjoint inference Collective inference No learning Baseline model Very efficient Accurate when autocorrelation is high and labels are randomly distributed in data
Disjoint learning Efficiency depends on model Can exploit identifiers and other attr/link dependencies in data Efficiency depends on model Can exploit both attribute and autocorrelation dependencies to move beyond simple CI models Collective -learning Neville & Provost 2001-2008 Least efficient More accurate when labeling is moderate, or when labels are clustered in the network Potential pathologies Statistical tests assume i.i.d data Networks have a combination of widely varying linkage and autocorrelation
which can complicate application of conventional statistical tests Nave hypothesis testing can bias feature selection (Jensen & Neville ICML02, Jensen et al. ICML03) Nave sampling methods can bias evaluation (Jensen & Neville ILP03) Neville & Provost 2001-2008 Bias in feature selection (Jensen & Neville ICML02) Relational classifiers can be biased toward features on some classes of objects (e.g., movie studios) How? Autocorrelation and linkage reduce effective sample size Lower effective sample size increases variance of estimated feature scores Higher variance increases likelihood that features will be picked by chance alone Can also affect ordering among features deemed significant because impact varies among features (based on linkage) Neville & Provost 2001-2008
Adjusting for bias: Randomization tests Randomization tests result in significantly smaller models (Neville et al KDD03) Attribute values are randomized prior to feature score calculation Empirical sampling distribution approximates the distribution expected under the null hypothesis, given the linkage and autocorrelation Neville & Provost 2001-2008 Metholodogy Within-network classification naturally implies dependent training and test sets How to evaluate models? Macskassy & Provost (JMLR07) randomly choose labeled sets of varying proportions (e.g., 10%. 20%) and then test on remaining unlabeled nodes Xiang & Neville (KDD-SNA08) choose nodes to label in various ways (e.g., random, degree, subgraph) See (Gallagher & Eliassi-Rad 2007) for further discussion
How to accurately assess performance variance? (Open question) Repeat multiple times to simulate independent trials, but Repeated training and test sets are dependent, which means that variance estimates could be biased (Dietterich 98) Graph structure is constant, which means performance estimates may not apply to different networks Neville & Provost 2001-2008 Understanding model performance (Neville & Jensen MLJ08) Collective inference is a new source of model error Potential sources of error: Approximate inference techniques Availability of test set information Location of test set information Need a framework to analyze model systems Bias/variance analysis for collective inference models Can differentiate errors due to learning and inference processes Neville & Provost 2001-2008 Conventional bias/variance analysis variance
Y* bias _ Y E D [Lsq (t, y)] = E D [(t E D [t]) 2 ] + (E D [t] E D [y]) 2 + E D [(E D [y] y) 2 ] noise Neville & Provost 2001-2008 bias variance Conventional bias/variance analysis M1 M2 Model predictions Test Set Training Set
M3 Samples Neville & Provost 2001-2008 Models Bias/variance decomposition for relational data learning inference bias bias E LI [Lsq (t, y)] = E L [(t E L [t]) 2 ] noise Expectati on over learning and inference Y* _ YL _ YLI
+(E L [t] E L [y]) 2 + E L [(E L [y] y) 2 ] learning bias learning variance +(E L [y] E LI [y]) 2 + E LI [(E L [y] y) 2 ] E L [(E LI [y] y) 2 ] inference variance inference bias +2(E L [y] E L [t])(E LI [y] E L [y]) bias interaction term Neville & Provost 2001-2008 Relational bias/variance analysis: part I + + + + + +
+ + + + + Training Set
+ + + + + M1 + M2 + + +
M3 Model predictions (learning distribution) Individual inference* on test set Samples Learn models from samples Neville & Provost 2001-2008 * Inference uses optimal probabili for neighboring nodes class lab Relational bias/variance analysis: part II
+ + + + M1 + + + + +
+ + + + Training Set
+ + + + M2 + + Model predictions
+ M3 Samples Learn models from samples Collective inference on test set Neville & Provost 2001-2008 (total distribution)
Analysis shows that models exhibit different errors RMNs have high inference bias Loss Neville & Provost 2001-2008 Bias RDNs have high inference variance Variance Some other issues: part I Propagating label information farther in the network Leverage other features (Gallagher & Eliassi-Rad SNA-KDD08) Create ghost edges (Gallagher et al. KDD08) Create similarity edges from other features (Macskassy AAAI07) Leverage graph similarity of nodes (Fouss et al. TKDE07) Latent group models (Neville & Jensen ICDM05)
Do we know anything about the dynamics of label propagation? e.g., do true labels propagate faster than false ones? see (Galstyan & Cohen 05a,05b,06,07) What if labeling nodes is costly? Choose nodes that will improve collective inference (Rattigan et al. 07, Bilgic & Getoor KDD 08) What if acquiring link data is costly? Acquire link data actively (Macskassy & Provost IA 05) Neville & Provost 2001-2008 Some other issues: part II What links you use makes a big difference Automatic link selection (Macskassy & Provost JMLR 07) Augment data graph w/2-hop paths (Gallagher et al. KDD 08) How does propagating information with collective inference relate to using identifiers? open question Can we identify the (causal) reason for the observed network correlation? Reasons might be: Homophily: similar nodes link together Social influence: linked nodes change attributes to similar values External factor: causes both link existence and attribute similarity Manski 93; Hill et al. 06; Bramoulle 07; Burk et al. 07; Ostreicher-Singer &
Sundararajan 08; Anagnostopoulos et al. KDD08 Neville & Provost 2001-2008 Some other issues: part III Computation and storage requirements can be prohibitive for data on real social networks -- how can we deal with massive (real) social networks? ignore most of the network (traditional method) use simple models/techniques! (e.g., Hill et al. 2007) reduce size of network via sampling/pruning of links and/or nodes, hopefully without reducing accuracy (much) (e.g., Cortes et al. 01; Singh et al. 05; Hill et al. 06b; Zheng et al. 07) What are the effects of partial network data collection? one may not have access to or complete control over collection of nodes and/or links different sampling/pruning methods may induce different effects (e.g., Stumpf et al. '05, Lee et al. '06, Handcock and Gile '02, Borgatti et al. '05) can we improve accuracy by sampling/pruning? irrelevant links/nodes can interfere with modeling (Hill et al. 2007) Neville & Provost 2001-2008 Some other issues: part IV How to model networks changing over time? Summarize dynamic graph w/kernel smoothing (Cortes et al. 01, Sharan & Neville SNA-KDD07) Sequential relational Markov models (Geustrin at al. IJCAI03, Guo et al. ICML07, Burk et al. 07)
How to jointly model attributes and link structure? RBNs with link uncertainty (Getoor et al. JMLR03) Model underlying group structure with both links and attributes (Kubica et al. AAAI02, McCallum et al. IJCAI05, Neville & Jensen ICDM05) Neville & Provost 2001-2008 Conclusions: part I 1. Social network data often exhibit autocorrelation, which can provide considerable leverage for inference 2. Labeled entities link to unlabeled entities Disjoint inference allows direct guilt-by-association Disjoint learning can use correlations among attributes of related entities to improve accuracy 3. Unlabeled entities link among themselves Inferences about entities can affect each other (e.g., indirect gba) Collective inference can improve accuracy Results show that there is a lot of power for prediction just in the network structure
Collective learning can improve accuracy for datasets with a moderate number of labels or when labels are clustered in the graph Neville & Provost 2001-2008 Conclusions: part II 5. The social network can be used to create variables that can be used in traditional (flat) modeling 6. More sophisticated learning techniques exploit networks correlation in alternative ways Node identifiers capture 2-hop autocorrelation patterns and linkage similarity Models of the joint network distribution identify global attribute dependencies These models can learn autocorrelation dependencies Shrinkage models can combine both local and global estimates 7. Learning and evaluating network models is difficult due to complex network structure and attribute interdependencies 8. There are many important methodological issues and open questions Neville & Provost 2001-2008
By this point, hopefully, you are familiar with: 1. a wide-range of potential applications for network mining 2. different approaches to network learning and inference 3. 4. 5. 6. from simple to complex Various issues involved with each approach when each approach is likely to perform well potential difficulties for learning accurate network models various methodological issues associated with analyzing network models Neville & Provost 2001-2008 Related network-analysis topics Identifying groups in social networks Predicting links Entity resolution Finding (sub)graph patterns Generative graph models Social network analysis (SNA)
see KDD08 tutorial: Graph Mining and Graph Kernels see KDD08 workshop: Social Network Mining & Analysis Preserving the privacy of social networks and SNA Please see tutorial webpage for slides and additional pointers: http://www.cs.purdue.edu/~neville/courses/kdd08-tutorial.html Neville & Provost 2001-2008 Thanks to Pelin Angin Avi Bernstein Scott Clearwater Lisa Friedland Brian Gallagher Henry Goldberg Michael Hay Shawndra Hill David Jensen John Komoroske Neville & Provost 2001-2008 Kelly Palmer Matthew Rattigan Ozgur Simsek Sofus Macskassy Andrew McCallum
Claudia Perlich Ben Taskar Chris Volinsky Rongjing Xiang Ron Zheng http://pages.stern.nyu.edu/~fprovost/ http://www.cs.purdue.edu/~neville foster provost jennifer neville Neville & Provost 2001-2008 Fun: Mining Facebook data Birthday School year, Status (yawn?) Finance Conservative Economics Moderate Premed Moderate Politics Moderate, Liberal or Very_Liberal Theatre Very_Liberal Random_play Apathetic Marketing Finance Premed Psychology Politics Economics Finance Interested_in_Women Communications Interested_in_Men Drama Like_Harry_Potter Dating A_Relationship, Interested_in_Men Dating A_Relationship, Interested_in_Women Interested_in_Men&Women Very_Liberal
Neville & Provost 2001-2008 (associations) Acronym guide ACORA: Automatic construction of relational attributes (Perlich & Provost KDD'03) AMN: Associative Markov network (Taskar ICML'04) BN: Bayesian network BLP: Bayesian logic program (Kersting & de Raedt '01) DN: Dependency network (Heckerman et al. JMLR'00) EM: Expectation maximization GRF: Gaussian random field (Zhu et al. ICML'03) ILP: Inductive logic programming MLN: Markov logic network (Richardson & Domingos MLJ'06) MN: Markov network NT: Network targeting (Hill et al.'06) PGM: Probabilistic graphical models Neville & Provost 2001-2008 PL: Pseudolikelihood RBC: Relational Bayes classifier (Neville et al. ICDM'03)
RBN: Relational Bayesian network (aka probabilistic relational models) (Friedman et al. IJCAI'99) RDB: Relational database RDN: Relational dependency network (Neville & Jensen ICDM'04) RGP: Relational Gaussian process (Chu et al. NIPS'06) RMN: Relational Markov network (Taskar et al. UAI'02) RPT: Relational probability trees (Neville et al. KDD'03) SLR: Structural logistic regression (Popescul et al. ICDM'03) wvRN: Weighted-vector relational neighbor (Macskassy & Provost JMLR'07)