GitHub - rug-compling/featuresqueeze: Maximum entropy and correlation-based feature selection

rug-compling / featuresqueeze Public

Notifications You must be signed in to change notification settings
Fork 0
Star 1

Maximum entropy and correlation-based feature selection

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 138 Commits
cmake		cmake
example		example
libfsqueeze		libfsqueeze
overlapviewer		overlapviewer
util		util
CMakeLists.txt		CMakeLists.txt
COPYING		COPYING
README		README

Repository files navigation

Feature selection for maximum entropy modeling
==============================================
Daniël de Kok <me@danieldk.eu>

Introduction
------------

featuresqueeze performs maximum entropy-based feature selection for
ranking tasks [1]. Selection is performed by attempting to add estimate
feature weights one at a time, where the weight/feature that provides
the highest gain is added [2]. To make this computationally feasible,
we assume that the weight of a feature that was added to a model stays
constant.

featuresqueeze also implements a fast feature selection method, that
assumes that gains of features rarely improve when a feature is added to
the model [3].

featuresqueeze also provides full optimization, which can be applied
after selecting a certain number of features to improve the accuracy
of the model.

This program has been developed for compacting models for the Alpino
fluency ranker for Dutch.

Compilation
-----------

Requirements:

- C++ standard library with TR1 extensions.
- The Eigen C++ template library for linear algebra.
- cmake 2.6 or later.

g++ 4.2.x satisfies these requirements. With these components in place,
compile by executing the following commands in the top-level directory:

    cmake .
    make

Usage
-----

Feature selection can be performed using the 'fsqueeze' command:

fsqueeze [OPTION] dataset

  -a val   Alpha convergence threshold (default: 1e-6)
  -c       Correlation selection
  -f       Fast maxent selection (do not recalculate all gains)
  -g val   Gain threshold (default: 1e-20)
  -l n     Apply L-BFGS optimization every n cycles (default: disabled)
  -n val   Maximum number of features
  -o       Find overlap (incompatible with -f)
  -r val   Correlation exclusion threshold (default: 0.9)

Where 'dataset' is a data set in TADM format minus the optional header
line.

To do
-----

- Test with more datasets, the program currently fails on some datasets that
  are very much different from mine.

License
-------

This application is free software; you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License as published by the
Free Software Foundation; either version 2.1 of the License, or (at your
option) any later version. This license is included in the file COPYING in
the top-level directory.

References
----------

1. Feature selection for fluency ranking, Daniël de Kok, Proceedings of the
   6th International Language Generation Conference (INLG 2010)
2. A maximum entropy approach to natural language processing, Adam L. Berger,
   Vincent J. Della Pietra, Stephen A. Della Pietra, Computational Linguistics,
   Volume 22, Issue 1, March 1996
3. A fast algorithm for feature selection in conditional maximum entropy
   modeling, Yaqian Zhou, Fuliang Weng, Lide Wu, Hauke Schmidt, Theoretical
   Issues In Natural Language Processing, Proceedings of the 2003 conference
   on Empirical methods in natural language processing, Volume 10