Reading Group Feature Selection - SS 07


Feature selection is the task of choosing an subset S of input variables/features that yields high performance when used by some learning algorithm. This explanation is somehow vague.

  • Representation of the superset: Let V be the set that contains all variables considered. Is V given explicitly, such that we can do search on it by locking at any member in turn, or is it only an abstract set from which we can generate members but where exhaustive search is unfeasible?
  • Division of Feature Selection and Learning: Doesn't the selection of useful variables and learning weights for a model using these variables belong together?
We do not intend to clarify the notion feature selection. Instead, we want to review work carrying that label in order to learn what selection criteria and techniques had been proposed and how they relate to learning scheme in line. As a starting point we take the Special Issue on Variable and Feature Selection

of which a survey is given in

  • [GuyonE05]: An Introduction to Variable and Feature Selection, Isabelle Guyon, André Elisseeff, JMLR Special Issue on Variable and Feature Selection (Kernel Machines Section), 2003.

Further, we want to contrast these techniques to those that are used in symbolic rule learning algorithms, as rules can also be taken as binary variables.


Date Reading Presenter
16.5.2007 Intro
23.5.2007 Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy Sang-Hyeun
30.5.2007 Searching for Interacting Features Nik
6.6.2007 Distributional Word Clusters vs. Words for Text Categorization Eneldo
13.6.2007 Ranking a Random Feature for Variable and Feature Selection Frederik
20.6.2007 Semi-supervised Feature Selection via Spectral Analysis Selection Lorenz

Systematics following [GuyonE05]

For an first overview we make the following distinction of feature selection approaches, where we follow the survey paper [GuyonE2005].
  • filters - selection schemes that do not depend on nor use any predictor
  • wrappers - approximated optimization of subsets w.r.t. the cost function of the predictor
  • embedded - integration of feature selection criteria into the cost function of the learner

Reading Proposals

  • further proposals welcome


  • [DhillonMK03]: A Divisive Information Theoretic Feature Clustering Algorithm for Text Classification, Inderjit S. Dhillon, Subramanyam Mallela, Rahul Kumar, 2003. About: Feature creation by distributional clustering based information theoretic relations, applied to (hierarchical) text classification. See [GuyonE05] section 5.1, see also Bekkerman et al. 2003.
  • [StoppigliaDDO03]: Ranking a Random Feature for Variable and Feature Selection, Hervé Stoppiglia, Gérard Dreyfus, Rémi Dubois, Yacine Oussar, 2003. About: Combines feature ranking via Gram-Schmid on example space with setting random probe in order get the right cut. See [GuyonE05], section 4.2.
  • [ZhengL07]: Searching for Interacting Features, Z. Zhao and H. Liu, 2007. About: find subsets of features that have joint relevance (..).
  • [ZhengL07b]: Semi-supervised Feature Selection via Spectral Analysis, Z. Zhao and H. Liu, 2007. About: Graphtheory based ranking of features. If the image of a point in the example space w.r.t. to the root of the Laplacian of the similarity matrix has great L_2 norm, the corresponding feature is considered to fit less to the given similarities and is therefore less relevant.


  • [PengLD05]: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy, Peng, H.C., Long, F., and Ding, C., 2005. About: Section 3 gives an analytical discussion of the selection criteria mentioned in the title. Later they show how they can efficiently be used for wrapping techniques.


  • [WestonEST03]: Use of the Zero Norm with Linear Models and Kernel Methods, Jason Weston, André Elisseeff, Bernhard Schölkopf, Mike Tipping, 2003. About: Prototype embedded selection where in SVM the L_2 regularization is (approximately) replaced by L_0, e.g. by counting the non-zeros.
  • [PerkinsLT03]: Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space, Simon Perkins, Kevin Lacker, James Theiler, 2003. About: Embedded selection based on sensitivity of some elaborated cost function to changes of so far unused features weights.
  • [McCallum03]: Efficiently Inducing Features of Conditional Random Fields. Andrew McCallum, 2003. About: Embedded forward selection based on the conditional likelihood of the CRF-model as trained so far.



small ke-icon

Knowledge Engineering Group

Fachbereich Informatik
TU Darmstadt

S2|02 D203
Hochschulstrasse 10

D-64289 Darmstadt

Telefon-Symbol+49 6151 16-21811
Fax-Symbol +49 6151 16-21812

A A A | Drucken | Impressum | Sitemap | Suche | Mobile Version
zum Seitenanfangzum Seitenanfang