Pushpa Publishing House

Journal Menu

Content

Volume 91 (2024)

Advances and Applications in Statistics

Advances and Applications in Statistics
Volume 78, , Pages 63 - 82 (July 2022)
http://dx.doi.org/10.17654/0972361722050

INVESTIGATING TERM WEIGHTING SCHEMES ON THE CLASSIFICATION PERFORMANCE FOR THE IMBALANCED TEXT DATA

Afra Al Manei, Iman Al Hasani and Ronald Wesonga

Abstract:

The effect of term weighting (TW) on the classification has been found to yield better results for the text data classification problem. However, little evidence exists for the essential differences among different TW schemes on the classification performance. In this study, we present the results of an investigation of three most popular TW schemes, namely, count, term frequency-inverse document frequency (TFIDF) and term frequency-inverse category frequency (TFICF) under the multinomial Naive Bayes (MNB) and support vector machine (SVM) classification algorithms using imbalanced text data. Our results revealed that the count weighting scheme with the MNB gives a higher macro-average recall compared to the other schemes with SVM. On the other hand, the TFICF with the SVM generates a higher macro-average recall compared to the other two schemes. The findings suggest that TW schemes have different effects on classification of imbalanced text data. Whereas the count weighting scheme performs better in classifying text data using the MNB, the same count scheme with SVM seems to handle the imbalanced data issue better than the count under the MNB classifier. Therefore, our findings reveal that the effect of TW schemes on the classification performance of imbalanced text data can greatly improve when the count weighting scheme is used with MNB and the TFICF with SVM classifier, respectively. This study is significant as it recommends a benchmark for the use and application of TW schemes for the classification algorithms with imbalanced text data.

Keywords and phrases:

term weighting, multinomial Naive Bayes, support vector machine, text analysis, research thesis.

Received: April 7, 2022; Accepted: May 26, 2022; Published: June 27, 2022

How to cite this article: Afra Al Manei, Iman Al Hasani and Ronald Wesonga, Investigating term weighting schemes on the classification performance for the imbalanced text data, Advances and Applications in Statistics 78 (2022), 63-82. http://dx.doi.org/10.17654/0972361722050

This Open Access Article is Licensed under Creative Commons Attribution 4.0 International License

References:

[1] S. M. Alzanin, A. M. Azmi and H. A. Aboalsamh, Short text classification for Arabic social media tweets, Journal of King Saud University - Computer and Information Sciences 2022 (in press). URL: https://www.sciencedirect.com/science/article/pii/S1319157822001045, doi: https://doi.org/10.1016/j.jksuci.2022.03.020.
[2] W. G. Cochran, Sampling Techniques, John Wiley & Sons, 2007.
[3] F. Debole and F. Sebastiani, Supervised term weighting for automated text categorization, Text Mining and its Applications, Springer, 2004, pp. 81-97.
[4] G. Domeniconi, G. Moro, R. Pasolini and C. Sartori, A comparison of term weighting schemes for text classification and sentiment analysis with a supervised variant of tf.idf, International Conference on Data Management Technologies and Applications, 2015, pp. 39-58.
[5] S. Dumais, J. Platt, D. Heckerman and M. Sahami, Inductive learning algorithms and representations for text categorization, Proceedings of the Seventh International Conference on Information and Knowledge Management, 1998, pp. 148-155.
[6] G. James, D. Witten, T. Hastie and R. Tibshirani, An Introduction to Statistical Learning, Volume 112, Springer, 2013.
[7] T. Joachims, Text categorization with support vector machines: learning with many relevant features, European Conference on Machine Learning, 1998, pp. 137-142.
[8] K. S. Jones, A statistical interpretation of term specificity and its application in retrieval, Journal of Documentation 28(1) (1972), 11-21.
[9] J. J. Jung, Exploiting geotagged resources for spatial clustering on social network services, Concurrency and Computation: Practice and Experience 28 (2016), 1356-1367.
[10] S. Kannan and V. Gurusamy, Preprocessing techniques for text mining, International Journal of Computer Science & Communication Networks 5 (2014), 7-16.
[11] Y. Ko, A study of term weighting schemes using class information for text classification, Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2012, pp. 1029-1030.
[12] K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes and D. Brown, Text classification algorithms: a survey, Information 10 (2019), 150.
[13] M. Lan, S.-Y. Sung, H.-B. Low and C.-L. Tan, A comparative study on term weighting schemes for text categorization, Proceedings of 2005 IEEE International Joint Conference on Neural Networks, IEEE, Volume 1, 2005, pp. 546-551.
[14] M. Lan, C. L. Tan, J. Su and Y. Lu, Supervised and traditional term weighting methods for automatic text categorization, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (2008), 721-735.
[15] J. Lever, M. Krzywinski and N. Altman, Erratum: Corrigendum: Classification evaluation, Nature Methods 13(10) (2016), 890-890.
[16] C. D. Manning, P. Raghavan and H. Schütze, Naïve Bayes text classification, Introduction to Information Retrieval, Cambridge University Press, 2008, pp. 234-265.
[17] A. Mazyad, F. Teytaud and C. Fonlupt, A comparative study on term weighting schemes for text classification, International Workshop on Machine Learning, Optimization and Big Data, Springer, 2017, pp. 100-108.
[18] G. Miner, J. Elder IV, A. Fast, T. Hill, R. Nisbet and D. Delen, Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications, Academic Press, 2012.
[19] B. Naderalvojoud and E. Akcapinar Sezer, Term evaluation metrics in imbalanced text categorization, Natural Language Engineering 26 (2020), 31-47.
doi:10.1017/S1351324919000317.
[20] T. Pranckevičius and V. Marcinkevičius, Comparison of Naive Bayes, random forest, decision tree, support vector machines, and logistic regression classifiers for text reviews classification, Baltic Journal of Modern Computing 5 (2017), 221.
[21] C. Robert, Machine learning, a probabilistic perspective, CHANCE 27(2) (2014), 62-63.
[22] G. Salton and M. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, New York, 1986.
[23] D. Wang and H. Zhang, Inverse-category-frequency based supervised term weighting scheme for text categorization (2010). arXiv preprint arXiv:1012.2609.
[24] Y. Yang and X. Liu, A re-examination of text categorization methods, Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, pp. 42-49.