Feature selection is the process of choosing a significant subset of features from the given feature set for pattern recognition. It can be treated as a pre-step before constructing the machine learning model which could improve the prediction result. By selecting the most significant features, it would be easier to reduce the time of training, reduce the complexity of the machine learning model, avoid data overfitting, and help the researcher to understand the source data. Most data types of features are either number or string, and most of their distributions are either continuous or categorized. However, there exists a type of feature called a binary feature whereas the value is either 1 or 0. Unfortunately, there is less research work addressing the situation where the large portion of features are binary features. Inspired by some existing feature selection methods, a new framework called FMC_SELECTOR was represented which addresses specifically to select significant binary features from highly imbalanced datasets. By combining the fisher linear discriminant analysis technique and the cross-entropy concept together in our framework, the FMC_SELECTOR can be used to select the most significant features from the given binary feature set. We assess the performance and prediction results of FMC_SELECTOR by comparing it with the other two most popular feature selection methods Univariate Importance (UI) and Recursive Feature Elimination (RFM). The proposed framework showed better results than the benchmarks. The new formula called Mapping Based Cross-Entropy Evaluation (MCE) was derived from cross-entropy which integrated mapping function to address the specific concerns for binary feature. The introduced evaluation method called Positive Case Prediction Score (PPS) could explore some additional information from imbalanced dataset where other existing methods were inadequate or not applicable.
展開(kāi)▼