THE ENHANCED VERSION OF TF-IDF FEATURE VECTOR FOR MALWARE DETECTION
In this paper, we proposed two enhanced versions of traditional TF-IDF methods. The first method considered the term frequency for the total documents. This used the method that enlarges the influence of feature vector with large frequency. The second method considered relative term frequency within the same document. This used scaled TF version that divided logarithm term frequency by max term frequency and related IDF version that uses logarithm. We verified them by three machine learning algorithms, multi-layer perceptron, decision tree and k-nearest neighbor classifier. The proposed methods showed better performance than existing methods. In addition, by considering the accuracy score and running time, we recommended the best combination between proposed methods and classifiers.
feature extraction, machine learning, malware detection, natural language processing, TF-IDF.