LIST OF READINGS FOR THE ML COURSE 2011 - EDOC, EPFL

Registration for presentation and questions

Please register:

  1. For one paper presentation on the paper presentation doodle.
  2. For three questions on the question doodle.

March 23: Kernel PCA - CCA

  1. Mika, S., Rätsch, G., Weston, J., Schölkopf, B., Smola, A. J., and Müller, K.- R., Learning discriminative and invariant nonlinear features, IEEE Transactions on Pattern Analysis and Machine Intelligence, 25 (5) , pp.623- 628, 2003

    kPCA-Mika2003.pdf

    Abstract: We incorporate prior knowledge to construct nonlinear algorithms for invariant feature extraction and discrimination. Employing a unified framework in terms of a nonlinearized variant of the Rayleigh coefficient, we propose nonlinear generalizations of Fisher's discriminant and oriented PCA using support vector kernel functions. Extensive simulations show the utility of our approach.

  2. M.E. Tipping. Sparse Kernel Principal Component Analysis. NIPS (2001)

    skpca_Tipping01.pdf

    Abstract: "Kernel' principal component analysis (PCA) is an elegant nonlinear generalization of the popular linear data analysis method, where a kernel function implicitly defines a nonlinear transformation into a feature space wherein standard PCA is performed. Unfortunately, the technique is not `sparse', since the components thus obtained are expressed in terms of kernels associated with every training vector. This paper shows that by approximating the covariance matrix in feature space by a reduced number of example vectors, using a maximum-likelihood approach, we may obtain a highly sparse form of kernel PCA without loss of effectiveness.
  3. Cheng Yang, Liwei Wang, and Jufu Feng, On Feature Extraction via Kernels, IEEE transactions on systems, man and cybernetics. Part B. Cybernetics vol:38 issue:2 pg:553 -557, 2008.

    kCCA-Yang_et_al08.pdf

    Abstract: Using the kernel trick idea and the kernels-as-features idea, we can construct two kinds of nonlinear feature spaces, where linear feature extraction algorithms can be employed to extract nonlinear features. In this correspondence, we study the relationship between the two kernel ideas applied to certain feature extraction algorithms such as linear discriminant analysis, principal component analysis, and canonical correlation analysis. We provide a rigorous theoretical analysis and show that they are equivalent up to different scalings on each feature. These results provide a better understanding of the kernel method.

March 30: Kernel ICA

  1. M. Asuncion Vicente, Patrik O. Hoyer, and Aapo Hyvarinen, Equivalence of Some Common Linear Feature Extraction Techniques for Appearance-Based Object Recognition Tasks, IWEE Trans. on Pattern Analysis and Machine Intelligence, vol. 29:5 2007

    PCA-ICA-Asuncion_et_al07.pdf

    Abstract: Recently, a number of empirical studies have compared the performance of PCA and ICA as feature extraction methods in appearance-based object recognition systems, with mixed and seemingly contradictory results. In this paper, we briefly describe the connection between the two methods and argue that whitened PCA may yield identical results to ICA in some cases. Furthermore, we describe the specific situations in which ICA might significantly improve on PCA.

  2. Riccardo Boscolo, Hong Pan and Vwani P. Roychowdhury. Independent Component Analysis Based on Nonparametric Density Estimation, IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 15, NO. 1, JANUARY 2004

    ICA-non-param.pdf

    Abstract: In this paper, we introduce a novel independent component analysis (ICA) algorithm, which is truly blind to the particular underlying distribution of the mixed signals. Using a nonparametric kernel density estimation technique, the algorithm performs simultaneously the estimation of the unknown probability density functions of the source signals and the estimation of the unmixing matrix. Following the proposed approach, the blind signal separation framework can be posed as a nonlinear optimization problem, where a closed form expression of the cost function is available, and only the elements of the unmixing matrix appear as unknowns. We conducted a series of Monte Carlo simulations, involving linear mixtures of various source signals with different statistical characteristics and sample sizes. The new algorithm not only consistently outperformed all state-of-the-art ICA methods, but also demonstrated the following properties: 1) Only a flexible model, capable of learning the source statistics, can consistently achieve an accurate separation of all the mixed signals. 2) Adopting a suitably designed optimization framework, it is possible to derive a flexible ICA algorithm that matches the stability and convergence properties of conventional algorithms. 3) A nonparametric approach does not necessarily require large sample sizes in order to outperform

  3. Te-Won Lee, M. Girolam and T. Sejnowski, Independent Component Analysis Using an Extended Infomax Algorithm for Mixed Subgaussian and Supergaussian Sources Neural Computation, 11:2, p. 417-441, 1999.

    ICA-Sejnowski99.pdf

    Abstract: An extension of the infomax algorithm of Bell and Sejnowski (1995) is presented that is able blindly to separate mixed signals with sub- and supergaussian source distributions. This was achieved by using a simple type of learning rule first derived by Girolami (1997) by choosing negentropy as a projection pursuit index. Parameterized probability distributions that have sub- and supergaussian regimes were used to derive a general learning rule that preserves the simple architecture proposed by Bell and Sejnowski (1995), is optimized using the natural gradient by Amari (1998), and uses the stability analysis of Cardoso and Laheld (1996) to switch between sub- and supergaussian regimes. We demonstrate that the extended infomax algorithm is able to separate 20 sources with a variety of source distributions easily. Applied to high-dimensional data from electroencephalographic recordings, it is effective at separating artifacts such as eye blinks and line noise from weaker electrical signals that arise from sources in the brain.

April 6: SVM

  1. A Comparison of Methods for Multiclass Support Vector Machines Chih-Wei Hsu and Chih-Jen Lin, IEEE Trans. On Neural Networks, Vol 13-2, 2002.

    SVM-Hsu2002.pdf

    Abstract: Support vector machines (SVMs) were originally designed for binary classification. How to effectively extend it for multiclass classification is still an ongoing research issue. Several methods have been proposed where typically we construct a multiclass classifier by combining several binary classifiers. Some authors also proposed methods that consider all classes at once. As it is computationally more expensive to solve multiclass problems, comparisons of these methods using large-scale problems have not been seriously conducted. Especially for methods solving multiclass SVM in one step, a much larger optimization problem is required so up to now experiments are limited to small data sets. In this paper we give decomposition implementations for two such "all-together" methods. We then compare their performance with three methods based on binary classifications: "one-against- all," "one-against-one," and directed acyclic graph SVM (DAGSVM). Our experiments indicate that the "one-against-one" and DAG methods are more suitable for practical use than the other methods. Results also show that for large problems methods by considering all data at once in general need fewer support vectors.

  2. Choosing Multiple Parameters for Support Vector Machines, O. Chappelle, V. Vapnik, O. Bousquet and S. Mukherjee, Machine Learning, 46, 131-159, 2002

    SVM-Chapelle2002.pdf

    Abstract: The problem of automatically tuning multiple parameters for pattern recognition Support Vector Machines (SVMs) is considered. This is done by minimizing some estimates of the generalization error of SVMs using a gradient descent algorithm over the set of parameters. Usual methods for choosing parameters, based on exhaustive search become intractable as soon as the number of parameters exceeds two. Some experimental results assess the feasibility of our approach for a large number of parameters (more than 100) and demonstrate an improvement of generalization performance.

  3. Teo, C. H., Globerson, A., Roweis, S., and Smola, A. J., Convex Learning with Invariances, Advances in Neural Information Processing Systems 20, J.C. Platt and D. Koller and Y. Singer and S. Roweis (Eds.), MIT Press, Cambridge, MA 2008.

    ConvexLearningSmola07.pdf

    Abstract: Incorporating invariances into a learning algorithm is a common problem in machine learning. We provide a convex formulation which can deal with arbitrary loss functions and arbitrary losses. In addition, it is a drop- in replacement for most optimization algorithms for kernels, including solvers of the SVMStruct family. The advantage of our setting is that it relies on column generation instead of modifying the underlying optimization problem directly.

April 13: Gaussian Process and Gaussian Mixture Models

  1. A Unifying View of Sparse Approximate Gaussian Process Regression, Joaquin Quinonero-Candela and Carl Edward Rasmussen, Journal of Machine Learning Research 6 (2005) 1939-1959

    SparseGP-Quinonero-Candela-Rasmussen05.pdf

    Abstract: We provide a new unifying view, including all existing proper probabilistic sparse approximations for Gaussian process regression. Our approach relies on expressing the effective prior which the methods are using. This allows new insights to be gained, and highlights the relationship between existing methods. It also allows for a clear theoretically justified ranking of the closeness of the known approximations to the corresponding full GPs. Finally we point directly to designs of new better sparse approximations, combining the best of the existing strategies, within attractive computational constraints.

  2. Wang, J. M., Fleet, D. J., Hertzmann, A. Gaussian Process Dynamical Models for Human Motion. In IEEE Transactions on Pattern Recognition and Machine Intelligence. February, 2008. pp. 283-298.

    GMDM_Wang08.pdf

    Abstract: We introduce Gaussian process dynamical models (GPDMs) for nonlinear time series analysis, with applications to learning models of human pose and motion from high-dimensional motion capture data. A GPDM is a latent variable model. It comprises a low dimensional latent space with associated dynamics, as well as a map from the latent space to an observation space. We marginalize out the model parameters in closed form by using Gaussian process priors for both the dynamical and the observation mappings. This results in a nonparametric model for dynamical systems that accounts for uncertainty in the model. We demonstrate the approach and compare four learning algorithms on human motion capture data, in which each pose is 50-dimensional. Despite the use of small data sets, the GPDM learns an effective representation of the nonlinear dynamics in these spaces.

  3. Khansari, M and Billard, A. Learning Stable Non-Linear Dynamical Systems with Gaussian Mixture Models, IEEE Trans. in Robotics, 2011.

    Khansari_Billard_TRO_2011.pdf

    Abstract: This paper presents a method for learning arbitrary discrete motions from a set of demonstrations. We model a motion as a nonlinear autonomous (i.e. time-invariant) Dynamical System (DS), and define sufficient conditions to ensure global asymptotic stability at the target. We propose a learning method, called Stable Estimator of Dynamical Systems (SEDS), to learn the parameters of the DS to ensure that all motions follow closely the demonstrations while ultimately reaching in and stopping at the target. Time- invariance and global asymptotic stability at the target ensures that the system can respond immediately and appropriately to perturbations encountered during the motion. The method is evaluated through a set of robotic experiments and on a library of human handwriting motions.

April 20: Decision trees

  1. Yali Amit, Donald Geman, and Kenneth Wilder. Joint Induction of Shape Features and Tree Classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(11):1300-1305, (1997)

    Amit_Geman_Wilder_-_Joint_Induction_Shape_Trees.pdf

    Abstract: We introduce a very large family of binary features for two- dimensional shapes. The salient ones for separating particular shapes are determined by inductive learning during the construction of classification trees. There is a feature for every possible geometric arrangement of local topographic codes. The arrangements express coarse constraints on relative angles and distances among the code locations and are nearly invariant to substantial affine and nonlinear deformations. They are also partially ordered, which makes it possible to narrow the search for informative ones at each node of the tree. Different trees correspond to different aspects of shape. They are statistically weakly dependent due to randomization and are aggregated in a simple way. Adapting the algorithm to a shape family is then fully automatic once training samples are provided. As an illustration, we classify handwritten digits from the NIST database; the error rate is .7 percent.

  2. Leo Breiman. Bagging Predictors. Machine Learning, 24(2):123-140, (1996)

    Breiman_-_Bagging_Predictors.pdf

    Abstract: Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. The aggregation averages over the versions when predicting a numerical outcome and does a plurality vote when predicting a class. The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets. Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy.

  3. Frank Moosmann, Eric Nowak, and Frederic Jurie. Randomized Clustering Forests for Image Classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(9):1632-1646, (2008)

    Moosmann_Nowak_Jurie_-_Randomized_Forest_Image_Classification.pdf

    Abstract: Some of the most effective recent methods for content-based image classification work by quantizing image descriptors, and accumulating histograms of the resulting visual word codes. Large numbers of descriptors and large codebooks are required for good results and this becomes slow using k-means. We introduce Extremely Randomized Clustering Forests - ensembles of randomly created clustering trees - and show that they provide more accurate results, much faster training and testing, and good resistance to background clutter. Second, an efficient image classification method is proposed. It combines ERC-Forests and saliency maps very closely with the extraction of image information. For a given image, a classifier builds a saliency map online and uses it to classify the image. We show in several state-of-the-art image classification tasks that this method can speed up the classification process enormously. Finally, we show that the proposed ERC-Forests can also be used very successfully for learning distance between images. The distance computation algorithm consists of learning the characteristic differences between local descriptors sampled from pairs of same or different objects. These differences are vector quantized by ERC-Forests and the similarity measure is computed from this quantization. The similarity measure has been evaluated on four very different datasets and always outperforms the state-of- the-art competitive approaches.

May 4: Boosting

  1. François Fleuret and Donald Geman. Stationary Features and Cat Detection. Journal of Machine Learning Research, 9:2549-2578, (2008).

    Fleuret_Geman_-_Stationary_Features.pdf

    Abstract: Most discriminative techniques for detecting instances from object categories in still images consist of looping over a partition of a pose space with dedicated binary classifiers. The efficiency of this strategy for a complex pose, that is, for fine-grained descriptions, can be assessed by measuring the effect of sample size and pose resolution on accuracy and computation. Two conclusions emerge: (1) fragmenting the training data, which is inevitable in dealing with high in-class variation, severely reduces accuracy; (2) the computational cost at high resolution is prohibitive due to visiting a massive pose partition. To overcome data-fragmentation we propose a novel framework centered on pose-indexed features which assign a response to a pair consisting of an image and a pose, and are designed to be stationary: the probability distribution of the response is always the same if an object is actually present. Such features allow for efficient, one-shot learning of pose-specific classifiers. To avoid expensive scene processing, we arrange these classifiers in a hierarchy based on nested partitions of the pose; as in previous work on coarse-to-fine search, this allows for efficient processing. The hierarchy is then "folded" for training: all the classifiers at each level are derived from one base predictor learned from all the data. The hierarchy is "unfolded" for testing: parsing a scene amounts to examining increasingly finer object descriptions only when there is sufficient evidence for coarser ones. In this way, the detection results are equivalent to an exhaustive search at high resolution. We illustrate these ideas by detecting and localizing cats in highly cluttered greyscale scenes.

  2. Gunnar Rätsch, Takashi Onoda, and Klaus R. Müller. Regularizing AdaBoost. Proceedings of the international conference on Neural Information Processing Systems, 564-570, (1998).

    Ratsch_Onoda_Muller_-_Regularizing_Adaboost.ps

    Abstract: Boosting methods maximize a hard classification margin and are known as powerful techniques that do not exhibit overfitting for low noise cases. Also for noisy data boosting will try to enforce a hard margin and thereby give too much weight to outliers, which then leads to the dilemma of non-smooth fits and overfitting. Therefore we propose three algorithms to allow for soft margin classification by introducing regularization with slack variables into the boosting concept: (1) AdaBoost reg and regularized versions of (2) linear and (3) quadratic programming AdaBoost. Experiments show the usefulness of the proposed algorithms in comparison to another soft margin classifier: the support vector machine.

  3. Robert E. Schapire and Yoram Singer. Improved Boosting Algorithms Using Confidence-rated Predictions. Machine Learning, 37(3):297-336, (1999).

    Schapire_Singer_-_Improved_Boosting_Confidence.ps

    Abstract: We describe several improvements to Freund and Schapire's AdaBoost boosting algorithm, particularly in a setting in which hypotheses may assign confidences to each of their predictions. We give a simplified analysis of AdaBoost in this setting, and we show how this analysis can be used to find improved parameter settings as well as a refined criterion for training weak hypotheses. We give a specific method for assigning confidences to the predictions of decision trees, a method closely related to one used by Quinlan. This method also suggests a technique for growing decision trees which turns out to be identical to one proposed by Kearns and Mansour. We focus next on how to apply the new boosting algorithms to multiclass classification problems, particularly to the multi-label case in which each example may belong to more than one class. We give two boosting methods for this problem, plus a third method based on output coding. One of these leads to a new method for handling the single-label case which is simpler but as effective as techniques suggested by Freund and Schapire. Finally, we give some experimental results comparing a few of the algorithms discussed in this paper.

May 11: Connectionnist methods I

  1. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learning representations by back-propagating errors. Nature, 323, 533--536

  2. 1986-rumelhart.pdf

    Abstract:We describe a new learning procedure, back-propagation, for networks of neurone-like units. The procedure repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector. As a result of the weight adjustments, internal 'hidden' units which are not part of the input or output come to represent important features of the task domain, and the regularities in the task are captured by the interactions of these units. The ability to create useful new features distinguishes back-propagation from earlier, simpler methods such as the perceptron-convergence procedure.

  3. Yann Le Cun, Léon Bottou, Genevieve B. Orr and Klaus-Robert Müller: Efficient Backprop, Neural Networks, Tricks of the Trade, Lecture Notes in Computer Science LNCS 1524, Springer Verlag, 1998.

    1998-lecun.pdf

    Abstract: The convergence of back-propagation learning is analyzed so as to explain common phenomenon observed by practitioners. Many undesirable behaviors of backprop can be avoided with tricks that are rarely exposed in serious technical publications. This paper gives some of those tricks, and offers explanations of why they work. Many authors have suggested that second-order optimization methods are advantageous for neural net training. It is shown that most "classical" second-order methods are impractical for large neural networks. A few methods are proposed that do not have these limitations.

  4. Léon Bottou: Online Algorithms and Stochastic Approximations, Online Learning and Neural Networks, Edited by David Saad, Cambridge University Press, Cambridge, UK, 1998.

    1998-bottou.pdf

    Abstract: The convergence of online learning algorithms is analyzed using the tools of the stochastic approximation theory, and proved under very weak conditions. A general framework for online learning algorithms is first presented. This framework encompasses the most common online learning algorithms in use today, as illustrated by several examples. The stochastic approximation theory then provides general results describing the convergence of all these learning algorithms at once.

May 18: Connectionnist methods II

  1. Waibel, A. Hanazawa, T. Hinton, G. Shikano, K. and Lang, K. Phoneme recognition using time-delay neural networks. IEEE Acoustics Speech and Signal Processing, 37, 328-339

    1989-waibel.pdf

    Abstract: In this paper we present a Time-Delay Neural Network (TDNN) approach to phoneme recognition which is characterized by two important properties. 1) Using a 3 layer arrangement of simple computing units, a hierarchy can be constructed that allows for the formation of arbitrary nonlinear decision surfaces. The TDNN learns these decision surfaces automatically using error backpropagation [1]. 2) The time-delay arrangement enables the network to discover acoustic-phonetic features and the temporal relationships between them independent of position in time and hence not blurred by temporal shifts in the input. As a recognition task, the speaker-dependent recognition of the phonemes "B," "D," and "G" in varying phonetic contexts was chosen. For comparison, several discrete Hidden Markov Models (HMM) were trained to perform the same task. Performance evaluation over 1946 testing tokens from three speakers showed that the TDNN achieves a recognition rate of 98.5 percent correct while the rate obtained by the best of our HMM's was only 93.7 percent. Closer inspection reveals that the network "invented" well-known acoustic-phonetic features (e.g., F2-rise, F2-fall, vowel-onset) as useful abstractions. It also developed alternate internal representations to link different acoustic realizations to the same concept.

  2. Y. LeCun: Generalization and Network Design Strategies, (CRG-TR-89-4), Department of Computer Science, University of Toronto, 1989

    lecun-89t.pdf

    Abstract: Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradient based learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of 2D shapes, are shown to outperform all other techniques. Real-life document recognition systems are composed of multiple modules including field extraction, segmentation recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN), allows such multimodule systems to be trained globally using gradient-based methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank cheque is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal cheques. It is deployed commercially and reads several million cheques per day.

  3. Léon Bottou, Yann Le Cun and Yoshua Bengio: Global Training of Document Processing Systems using Graph Transformer Networks, Proc. of Computer Vision and Pattern Recognition, 489-493, IEEE, Puerto-Rico, 1997.

    1997-bottou.pdf

    Section VI only (for the notation) of Y. LeCun, L. Bottou, Y. Bengio and P. Haffner: Gradient-Based Learning Applied to Document Recognition, Proceedings of the IEEE, 86(11):2278-2324, November 1998.

    1998-lecun.pdf

    Abstract: We propose a new machine learning paradigm called Graph Transformer Networks that extends the applicability of gradient-based learning algorithms to systems composed of modules that take graphs as inputs and produce graphs as output. Training is performed by computing gradients of a global objective function with respect to all the parameters in the system using a kind of back-propagation procedure. A complete check reading system based on these con- cepts is described. The system uses convolutional neural network character recognizers, combined with global train- ing techniques to provides record accuracy on business and personal checks. It is presently deployed commercially and reads million of checks per month.