-
Notifications
You must be signed in to change notification settings - Fork 0
Maximum entropy and correlation-based feature selection
License
rug-compling/featuresqueeze
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Feature selection for maximum entropy modeling ============================================== Daniël de Kok <me@danieldk.eu> Introduction ------------ featuresqueeze performs maximum entropy-based feature selection for ranking tasks [1]. Selection is performed by attempting to add estimate feature weights one at a time, where the weight/feature that provides the highest gain is added [2]. To make this computationally feasible, we assume that the weight of a feature that was added to a model stays constant. featuresqueeze also implements a fast feature selection method, that assumes that gains of features rarely improve when a feature is added to the model [3]. featuresqueeze also provides full optimization, which can be applied after selecting a certain number of features to improve the accuracy of the model. This program has been developed for compacting models for the Alpino fluency ranker for Dutch. Compilation ----------- Requirements: - C++ standard library with TR1 extensions. - The Eigen C++ template library for linear algebra. - cmake 2.6 or later. g++ 4.2.x satisfies these requirements. With these components in place, compile by executing the following commands in the top-level directory: cmake . make Usage ----- Feature selection can be performed using the 'fsqueeze' command: fsqueeze [OPTION] dataset -a val Alpha convergence threshold (default: 1e-6) -c Correlation selection -f Fast maxent selection (do not recalculate all gains) -g val Gain threshold (default: 1e-20) -l n Apply L-BFGS optimization every n cycles (default: disabled) -n val Maximum number of features -o Find overlap (incompatible with -f) -r val Correlation exclusion threshold (default: 0.9) Where 'dataset' is a data set in TADM format minus the optional header line. To do ----- - Test with more datasets, the program currently fails on some datasets that are very much different from mine. License ------- This application is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version. This license is included in the file COPYING in the top-level directory. References ---------- 1. Feature selection for fluency ranking, Daniël de Kok, Proceedings of the 6th International Language Generation Conference (INLG 2010) 2. A maximum entropy approach to natural language processing, Adam L. Berger, Vincent J. Della Pietra, Stephen A. Della Pietra, Computational Linguistics, Volume 22, Issue 1, March 1996 3. A fast algorithm for feature selection in conditional maximum entropy modeling, Yaqian Zhou, Fuliang Weng, Lide Wu, Hauke Schmidt, Theoretical Issues In Natural Language Processing, Proceedings of the 2003 conference on Empirical methods in natural language processing, Volume 10
About
Maximum entropy and correlation-based feature selection
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published