Instructor : Saeed Shiry Course Overview Statistical learning theory Statistical Learning Theory The Nature of Statistical Learning Theory By: Vladimir N. Vapnik Advances in Learning Theory

Advances in Learning Theory: Methods, Models and Applications Statistical learning theory Statistical learning theory was introduced in the late 1960's. Until the 1990's it was a purely theoretical analysis of the problem of function estimation from a given collection of data. In the middle of the 1990's new types of learning algorithms (called support vector machines) based on the developed theory were proposed. This made statistical learning theory not only a tool for the theoretical analysis but also a tool for creating practical

algorithms for estimating multidimensional functions. A basic question What must one know a priori about an unknown functional dependency in order to estimate it on the basis of observations? New function approximation methods New function estimation methods have been created where a high dimensionality of the unknown function does not always require a large number of observations in order to obtain a good estimate.

The new methods control generalization using capacity factors that do not necessarily depend on dimensionality of the space. Learning and Statistics The problem of learning is so general that almost any question that has been discussed in statistical science has its analog in learning theory. Furthermore, some very important general results were first found in the framework of learning theory and then reformulated in the terms of statistics. In particular, learning theory for the first time stressed the problem of small sample statistics. It was shown that by taking into account the size of the sample one can obtain better solutions to many problems of function estimation than by using the methods based on

classical statistical techniques. Theoretical needs Concepts describing the necessary and sufficient conditions for consistency of inference. Bounds describing the generalization ability of learning machines based on these concepts. Inductive inference for small sample sizes, based on these bounds. Methods for implementing this new type of inference. Goals of this book A famous sentence in machine learning;

Complex theories do not work, simple algorithms do One of the goals of this book is to show that, at least in the problems of statistical inference, this is not true. Instead: Nothing is more practical than a good theory Four Periods in the Research of the Learning Problem (i)

Constructing the first learning machines, (ii) constructing the fundamentals of the theory, (iii) constructing neural networks, (iv) constructing the alternatives to neural networks. The philosophy of applied analysis of the learning process The philosophy of applied analysis of the learning process can be described as follows:

To get a good generalization it is sufficient to choose the coefficients of the neuron that provide the minimal number of training errors. The principle of minimizing the number of training errors is a self-evident inductive principle, and from the practical point of view does not need justification. The main goal of applied analysis is to find methods for constructing tile coefficients simultaneously for all neurons such that the separating surface provides the minimal number of errors on the training data. The philosophy of theoretical analysis of learning processes The philosophy of theoretical analysis of learning processes is different:

The principle of minimizing the number of training errors is not self-evident and needs to be justified. It is possible that there exists another inductive principle that provides a better level of generalization ability. The main goal of theoretical analysis of learning processes is to find the inductive principle with the highest level of generalization ability and to construct algorithms that realize this inductive principle. This book shows that indeed the principle of minimizing the number of training errors is not self-evident and that there exists another more intelligent inductive principle that provides a better level of generalization ability.

Theory of the Empirical Risk Minimization Principle As early as 1968, a philosophy of statistical learning theory had been developed. The essential concepts of the emerging theory, VC entropy and VC dimension, functions (i.e., for the pattern recognition problem). Using these concepts, the law of large numbers in functional space (necessary and sufficient conditions for uniform convergence of the frequencies to their probabilities) was found, its relation to learning processes was described, and the main

nonasymptotic bounds for the rate of convergence were obtained The obtained bounds made the introduction of a novel inductive principle possible (structural risk minimization inductive principle, 1974), completing the development of pattern recognition learning theory. Theory of Solving Ill-Posed Problems In the early 1900s Hadamard observed that under some (very general) circumstances the problem of solving (linear) operator equations (finding f F that satisfies the equality), is ill-posed; even if there exists a unique solution to this equation, a small deviation on the right-hand side of this equation (F instead of F,

where ||F- F ||< is arbitrarily small) can cause large deviations in the solutions (it can happen that ||f -f||< is large). In this case if the right-hand side F of the equation is not exact (e.g., it equals F , where F differs from F by some level of noise), the functions f that minimize the function do not guarantee a good approximation to the desired solution even if tends to zero. real-life problems were found to be ill-posed Hadamard

thought that ill-posed problems are a pure mathematical phenomenon and that all real-life problems are "well-posed. However, in the second half of the century a number of very important real-life problems were found to be ill-posed. it is important that one of main problems of statistics, estimating the density function from the data, is ill-posed. Regularization theory

Regularization theory was one of the first signs of the existence of intelligent inference: In the middle of the 1960s it was discovered that if instead of the functional R(f) one minimizes another so-called regularized functional where (f) is some functional (that belongs to a special type of functionals) and () is an appropriately chosen constant (depending on the level of noise), then one obtains a sequence of solutions that converges to the desired one as tends to zero Nonparametric Methods of Density Estimation

the problem of density estimation from a rather wide set of densities is ill-posed. Estimating densities from some narrow set of densities (say from a set of densities determined by a finite number of parameters, i.e., from a so-called parametric set of densities) was the subject of the classical paradigm, where a "self-evident" type of inference (the maximum likelihood method) was used. To estimate a density from the wide (nonparametric) set requires a new type of inference that contains regularization techniques. Nonparametric methods of density estimation gave rise to statistical algorithms that overcame the shortcomings of the classical paradigm. Now one could estimate functions from a wide set of functions. One has to note, however, that these methods are intended for

estimating a function using large sample sizes. The Idea of Algorithmic Complexity 1. 2. Two fundamental questions that at first glance look different inspired this idea: What is the nature of inductive inference (Solomonoff)? What is the nature of randomness (Kolmogorov), (Chaitin)? The answers to these questions proposed by Solomonoff, Kolmogorov, and Chaitin started the

information theory approach to the problem of inference. randomness concept The idea of the randomness concept can be roughly described as follows: A rather large string of data forms a random string if there are no algorithms whose complexity is much less than l, the length of the string, that can generate this string. The complexity of an algorithm is described by the length of the smallest program that embodies that algorithm.

It was proved that if the description of the string cannot be compressed using computers, then the string possesses all properties of a random sequence. This implies the idea that if one can significantly compress the description of the given string, then the algorithm used describes intrinsic properties of the data. In the 1970s, on the basis of these ideas, Rissanen suggested the minimum description length (MDL) inductive inference for learning problems All these new ideas are still being developed. However, they have shifted the main understanding as to what can be done in the problem of dependency estimation on the basis of a limited amount of empirical data. Chapter 1 Setting of the Learning Problem We consider the learning problem as a problem of finding a desired dependence using a limited number of observations.

FUNCTION ESTIMATION MODEL The model of learning from examples can be described using three components 1. 2. 3. A generator (G) of random vectors x Rn , drawn independently from a fixed but unknown probability distribution function F(x). A supervisor (S) who returns an output value y to every input vector x, according to a conditional distribution function P(y|x),

also fixed but unknown. A learning machine (LM) capable of implementing a set of functions f(x,), A, where A is a set of parameters. The problem of learning is that of choosing from the given set of functions f(x,), A, the one that best approximates the supervisor's response. Setting of the Learning Problem During the learning process, the learning machine observes the pairs (x,y) (the training set). After training, the machine must on any given x return a value y^. The goal is to return a value y^ that is close to the supervisor's

response y. Problem of risk minimization In order to choose the best available approximation to the supervisor's response, one measures the loss or discrepancy L(y, f(x, a)) between the response y of the supervisor to a given input x and the response f(x, a) provided by the learning machine. Consider the expected value of the loss, given by the risk functional The goal is to find the function f(x, , a) which minimizes the risk functional R(a) over the class of functions f(x,), A in the situation where the joint probability distribution P(x,y) is unknown and the only available information is contained in the training set.

Three Main Learning Problems 1. Pattern Recognition Let the supervisor's output y take only two values y = {0,1} and let f(x,), A, be a set of indicator functions (functions which take only two values: zero and one). Consider the following loss function: For this loss function, the functional (1.2) determines the probability of different answers given by the supervisor and by

the indicator function f(x, ). We call the case of different answers a classification error. The problem, therefore, is to find a function that minimizes the probability of classification error when the probability measure F(x, y) is unknown, but the data are given. Three Main Learning Problems 2. Regression Estimation Let the supervisor's answer y be a real value, and let f(x, ), A, be a set of real functions that contains the regression function It is known that the regression function is the one that minimizes

the functional (1.2) with the following loss function: Thus the problem of regression estimation is the problem of minimizing the risk functional (1.2) with the above loss function in the situation where the probability measure P(x,y) is unknown but the data are given. Three Main Learning Problems 3. Density Estimation (Fisher-Wald Setting) Finally, consider the problem of density estimation from the set of densities p(x, ) A. For this problem we consider the following loss function:

It is known that the desired density minimizes the risk functional (1.2) with the above loss function . Thus, again, to estimate the density from the data one has to minimize the risk functional under the condition that the corresponding probability measure P(x) is unknown, but i.i.d. data are given. THE EMPIRICAL RISK MINIMIZATION (ERM) INDUCTIVE PRINCIPLE In order to minimize the risk functional for an unknown probability

measure P(z) the following induction principle is usually employed. The expected risk functional R() is replaced by the empirical risk functional Constructed The on the basis of the training set. principle is to approximate the function Q(z, ) which minimizes the risk by the function Q(z, l) which miniminimizes the empirical risk (1.8). This principle is called the Empirical Risk Minimization induction principle (ERM principle). Empirical risk minimization principle and the classical methods The

classical methods for solving a specific learning problem, such as the least squares method in the problem of regression estimation or the maximum likelihood method in the problem of density estimation are realizations of the ERM principle for the specific loss functions considered above. For different learning settings the ERM can be written as: Regression Problem Density Estimation Since the ERM principle is a general formulation of these classical estimation problems, any theory concerning the ERM principle applies to the

classical methods as well. THE FOUR PARTS OF LEARNING THEORY Learning theory has to address the following four questions: 1. What are (necessary and sufficient) conditions for consistency of a learning process based on the ERM principle? 2. How fast is the rate of convergence of the learning process? 3. How can one control the rate of convergence (the generalization ability) of the learning process? 4. How can one construct algorithms that can control the generalization ability?

What are the conditions for consistency of the ERM principle? To answer this question one has to specify the necessary and sufficient conditions for convergence in probabilityof the following sequences of the random values: The values of risks R{al) converging to the minimal possible value of the risk R(a>0) (where R(al), l = 1,2,... are the expected risks for functions Q(z, a l) each minimizing the empirical risk Remp(al)) The values of obtained empirical risks Remp (al), l = 1,2,... converging to the minimal possible value of the risk R(a0)

The first equation shows that solutions found using ERM converge to the best possible one. Equation shows that empirical risk values converge to the value of the smallest risk. The Answer The answers to these questions form the four parts of learning theory: 1. Theory of consistency of learning processes. 2. Nonasymptotic theory of the rate of convergence of learning processes. 3. Theory of controlling the generalization ability of learning processes, 4. Theory of constructing learning algorithms. Each of these four parts will be discussed in the following chapters.

Projects Manifolds ICA Sparse coding and Labeling Multiclass SVM Gaussian Process Course Evaluation Book Chapter Project Homework Exam Documentation

References Advances in Learning Theory: Methods, Models and Applications, Edited by J.A.K. Suykens ,G. Horvath ,S. Basu C., Micchelli ,J. Vandewalle, 2003 The Nature of Statistical Learning Theory, Vladimir 51. Vapnik, 2000 Learning with Kernels Related Papers