Pushpa Publishing House

Journal Menu

Content

Volume 91 (2024)

Advances and Applications in Statistics

Advances and Applications in Statistics
Volume 72, , Pages 41 - 54 (January 2022)
http://dx.doi.org/10.17654/0972361722003

A COMPARATIVE STUDY FOR STATISTICAL OUTLIER DETECTION USING COLON CANCER DATA

M. Vidya Bhargavi and V. Sireesha

Abstract:

Outliers are the data that do not follow the normal/hypothesized trend of data. They are an ‘atypical’ or even a ‘rare’ or ‘anomalies’ or ‘abnormal’ data points that do not follow the flow. Detection of outliers is the primary step in obtaining results of any statistical or machine learning analysis. It is important to note that there is no fixed equation or methodology for finding outliers. We, of course, have a definition, but, what may be an outlier to one person may not be an outlier to someone else. In this paper, we will present a few outlier techniques employed on colon cancer data. We will proceed to identify which among the few testing techniques are more fruitful in identifying outliers in our dataset.

Keywords and phrases:

statistical outlier (anomaly) detection, colon cancer, tumor sizes, Tukey method, Chauvenet’s criteria, skewness, kurtosis.

Received: July 20, 2021; Accepted: November 12, 2021; Published: December 27, 2021

How to cite this article: M. Vidya Bhargavi and V. Sireesha, A comparative study for statistical outlier detection using colon cancer data, Advances and Applications in Statistics 72 (2022), 41-54. DOI: 10.17654/0972361722003

This Open Access Article is Licensed under Creative Commons Attribution 4.0 International License

[1] Douglas M. Hawkins, Identification of Outliers, Vol. 11, Chapman and Hall, London, 1980.
[2] Tianming Hu and Sam Y. Sung, Detecting pattern-based outliers, Pattern Recognition Letters 24(16) (2003), 3059-3068.
[3] Frank J. Anscombe, Rejection of outliers, Technometrics 2(2) (1960), 123-146.
[4] Shiblee Sadik and Le Gruenwald, Online outlier detection for data streams, Proceedings of the 15th Symposium on International Database Engineering and Applications, 2011.
[5] Seungmi Yang and Jennifer A. Hutcheon, Identifying outliers and implausible values in growth trajectory data, Annals of Epidemiology 26(1) (2016), 77-80.
[6] Varun Chandola, Arindam Banerjee and Vipin Kumar, Anomaly detection: a survey, ACM Computing Surveys (CSUR) 41(3) (2009), 1-58.
[7] Karanjit Singh and Shuchita Upadhyaya, Outlier detection: applications and techniques, International Journal of Computer Science Issues (IJCSI) 9(1) (2012), 307-323.
[8] Jiawei Han, Jian Pei and Micheline Kamber, Data Mining: Concepts and Techniques, Elsevier, 2011.
[9] Xiuyao Song et al., Conditional anomaly detection, IEEE Transactions on Knowledge and Data Engineering 19(5) (2007), 631-645.
[10] Andreas S. Weigend, Morgan Mangeas and Ashok N. Srivastava, Nonlinear gated experts for time series: discovering regimes and avoiding overfitting, International Journal of Neural Systems 6(4) (1995), 373-399.
[11] Yufeng Kou, Chang-Tien Lu and Dechang Chen, Spatial weighted outlier detection, Proceedings of the 2006 SIAM International Conference on Data Mining, Society for Industrial and Applied Mathematics, 2006.
[12] Ary L. Goldberger et al., PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals, Circulation 101(23) (2000), e215-e220.
[13] Xiaodan Xu et al., A comparison of outlier detection techniques for high-dimensional data, International Journal of Computational Intelligence Systems 11(1) (2018), 652-662.
[14] Victoria Hodge and Jim Austin, A survey of outlier detection methodologies, Artificial Intelligence Review 22(2) (2004), 85-126.
[15] Charu C. Aggarwal and Philip S. Yu, Outlier detection for high dimensional data, Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, 2001.
[16] Edwin M. Knorr, Raymond T. Ng and Vladimir Tucakov, Distance-based outliers: algorithms and applications, The VLDB Journal 8(3) (2000), 237-253.
[17] Peng Yang and Biao Huang, An efficient outlier mining algorithm for large dataset, 2008 International Conference on Information Management, Innovation Management and Industrial Engineering, Vol. 1, IEEE, 2008.
[18] Markus M. Breunig et al., LOF: identifying density-based local outliers, Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, 2000.
[19] Dantong Yu, Gholamhosein Sheikholeslami and Aidong Zhang, Findout: finding outliers in very large datasets, Knowledge and Information Systems 4(4) (2002), 387-412.
[20] Mon-Fong Jiang, Shian-Shyong Tseng and Chih-Ming Su, Two-phase clustering process for outliers detection, Pattern Recognition Letters 22 (6-7) (2001), 691-700.
[21] Graham Williams et al., A comparative study of RNN for outlier detection in data mining, 2002 IEEE International Conference on Data Mining, Proceedings, IEEE, 2002.
[22] Yupeng Wang and Romdhane Rekaya, LSOSS: detection of cancer outlier differential gene expression, Biomarker Insights 5 (2010), 69-78.
[23] Jung Hun Oh and Jean Gao, A kernel-based approach for detecting outliers of high-dimensional biological data, BMC Bioinformatics 10 (2009), Art. no.: S7.
[24] N. Howlader A. M. Noone, M. Krapcho, D. Miller, A. Brest, M. Yu, J. Ruhl, Z. Tatalovich, A. Mariotto, D. R. Lewis, H. S. Chen, E. J. Feuer, K. A. Cronin (eds.), SEER Cancer Statistics Review, 1975-2016, National Cancer Institute, Bethesda, MD, https://seer.cancer.gov/csr/1975_2016/, based on November 2018 SEER data submission, posted to the SEER website, April 2019.
[25] M. Vidya Bhargavi, Venkateswara Rao Mudunuru and Sireesha Veeramachaneni, Colon cancer stage classification using decision trees, Data Engineering and Communication Technology, Springer, Singapore, 2020, pp. 599-609.
[26] Walter Andrew Shewhart, Economic Control of Quality of Manufactured Product, Macmillan and Co. Ltd., London, 1931.
[27] Ronald E. Shiffler, Maximum Z scores and outliers, Amer. Statist. 42(1) (1988), 79-80.
[28] Boris Iglewicz and David Caster Hoaglin, How to Detect and Handle Outliers, Vol. 16, ASQ Press, 1993.
[29] Jorma Laurikkala et al., Informal identification of outliers in medical data, Fifth International Workshop on Intelligent Data Analysis in Medicine and Pharmacology, Vol. 1, 2000.
[30] Helge Erik Solberg and Ari Lahti, Detection of outliers in reference distributions: performance of Horn’s algorithm, Clinical Chemistry 51(12) (2005), 2326-2332.
[31] Lily Lin and Paul D. Sherman, Cleaning data the Chauvenet way, Proceedings of the South East SAS Users Group, SESUG Proceedings, Paper SA11, 2007, pp. 1-11.
[32] L. N. Bol’shev and M. Ubaidullaeva, Chauvenet’s test in the classical theory of errors, Theory of Probability and its Applications 19(4) (1975), 683-692.