8 Pages
English

Predictive low rank decomposition for kernel methods

-

Gain access to the library to view online
Learn more

Description

Niveau: Supérieur, Doctorat, Bac+8
Predictive low-rank decomposition for kernel methods Francis R. Bach Centre de Morphologie Mathematique, Ecole des Mines de Paris 35 rue Saint-Honore, 77300 Fontainebleau, France Michael I. Jordan Computer Science Division and Department of Statistics University of California, Berkeley, CA 94720, USA Abstract Low-rank matrix decompositions are essen- tial tools in the application of kernel meth- ods to large-scale learning problems. These decompositions have generally been treated as black boxes—the decomposition of the kernel matrix that they deliver is indepen- dent of the specific learning task at hand— and this is a potentially significant source of inefficiency. In this paper, we present an algorithm that can exploit side informa- tion (e.g., classification labels, regression re- sponses) in the computation of low-rank de- compositions for kernel matrices. Our al- gorithm has the same favorable scaling as state-of-the-art methods such as incomplete Cholesky decomposition—it is linear in the number of data points and quadratic in the rank of the approximation. We present simulation results that show that our algo- rithm yields decompositions of significantly smaller rank than those found by incomplete Cholesky decomposition. 1. Introduction Kernel methods provide a unifying framework for the design and analysis of machine learning algo- rithms (Scholkopf and Smola, 2001, Shawe-Taylor and Cristianini, 2004).

  • kernel matrix

  • can thus

  • gadvk?1 ?

  • look-ahead decompositions

  • cholesky decomposition

  • learning task

  • jordan jordan


Subjects

Informations

Published by
Reads 25
Language English
Predictive lowrank
decomposition for kernel methods
Francis R. Bach CentredeMorphologieMath´ematique,EcoledesMinesdeParis 35rueSaintHonor´e,77300Fontainebleau,France Michael I. Jordan Computer Science Division and Department of Statistics University of California, Berkeley, CA 94720, USA
Abstract Lowrank matrix decompositions are essen tial tools in the application of kernel meth ods to largescale learning problems. These decompositions have generally been treated as black boxes—the decomposition of the kernel matrix that they deliver is indepen dent of the specific learning task at hand— and this is a potentially significant source of inefficiency. In this paper, we present an algorithm that can exploit side informa tion (e.g., classification labels, regression re sponses) in the computation of lowrank de compositions for kernel matrices. Our al gorithm has the same favorable scaling as stateoftheart methods such as incomplete Cholesky decomposition—it is linear in the number of data points and quadratic in the rank of the approximation. We present simulation results that show that our algo rithm yields decompositions of significantly smaller rank than those found by incomplete Cholesky decomposition.
1. Introduction Kernel methods provide a unifying framework for the design and analysis of machine learning algo rithms(Sch¨olkopfandSmola,2001,ShaweTaylorand Cristianini, 2004). A key step in any kernel method is the reduction of the data to akernel matrix, also known as aGram matrix. Given the kernel matrix, generic matrixbased algorithms are available for solv ing learning problems such as classification, prediction, anomaly detection, clustering and dimensionality re
nd Appearing inInternational ConferProceedings of the 22 ence on Machine LearningCopy, Bonn, Germany, 2005. right 2005 by the author(s)/owner(s).
francis.bach@mines.org
jordan@cs.berkeley.edu
duction. There are two principal advantages to this division of labor: (1) any reduction that yields a pos itive semidefinite kernel matrix is allowed, a fact that opens the door to specialized transformations that ex ploit domainspecific knowledge; and (2) expressed in terms of the kernel matrix, learning problems often take the form of convex optimization problems, and powerful algorithmic techniques from the convex op timization literature can be brought to bear in their solution.
An apparent drawback of kernel methods is the naive computational complexity associated with manipulat ing kernel matrices. Given a set ofndata points, the kernel matrixKis of sizen×nsuggests a compu. This 2 tational complexity of at leastO(n); in fact most ker nel methods have at their core operations such as ma trix inversion or eigenvalue decomposition which scale 3 asO(n). Moreover, some kernel algorithms make use of sophisticated tools such as semidefinite program ming and have even higherorder polynomial complex ities (Lanckriet et al., 2004). These generic worstcase complexities can often be skirted, and this fact is one of the major reasons for the practical success of kernel methods. The under lying issue is that the kernel matrices often have a spectrum that decays rapidly and are thus of small nu merical rank (Williams and Seeger, 2000). Standard algorithms from numerical linear algebra can thus be exploited to compute an approximation of the form >n×m L=GG, whereGR, and where the rankm is generally significantly smaller thannit. Moreover, is often possible to reformulate kernelbased learning algorithms to make use ofGinstead ofK. The re sulting computational complexity generally scales as 3 2 O(m+m nlinear scaling in). This nmakes kernel based methods viable for largescale problems. To achieve this desirable result, it is of course neces sary that the underlying numerical linear algebra rou