ODED Outlier Detection in Educational Data
Abstract
Clustering data streams is one of the prominent tasks of discovering hidden patterns in data streams. It refers to the process of clustering newly arrived data into continuously and dynamically changing segmentation patterns. The current data stream clustering algorithms are lacking general clear steps for analysing new incoming data chunks. However, the majority of existing data stream solutions are adapting the clustering methods of static data to work with data stream setting. The main issue of concern is to propose a solution can improve the performance of existing approaches and present correct clusters and outliers. Data arriving in streams often contain outliers, which may have equal importance as clusters. Thus, it is desirable for data stream clustering algorithms to be able to detect the outliers as well as the clusters. The data stream clustering algorithms should be able to minimise the effects of noise and outliers data in a given dataset. This article presents a stream mining algorithm to cluster the data stream and monitor its evolution. Even though outlier detection is expected to be present in data streams, explicit outlier detection is rarely done in stream clustering algorithms. The proposed method is capable of explicit outlier detection and cluster evolution analysis. Relationship between outlier detection and the occurrence of physical events has been studied by applying the algorithm on the education data stream. Experiments led to the conclusion that the outlier detection accompanied by a change in the number of clusters indicates a significant education event. This kind of online monitoring and its results can be utilized in education systems in various ways. Viber education data streams produced by Viber groups are used to conduct this study.
References
- M. Hassani and T. Seidl, Clustering Big Data streams: recent challenges and contributions, It-Information Technol., vol. 58, no. 4, pp. 206213, 2016.
- A. Amini, An Adaptive Density-Based Method for Clustering Evolving Data Streams, Thesis, University of Malaya Kuala Lumpur, Department of Computer Science and Information Technology, 2014.
- R. N. Davies, Efficient Analysis of Data Streams, Thesis, Lancaster University, Department of Computing and Communications, 2017.
- Q. Li, X. Ma, S. Tang, and S. Xie, Continuously identifying representatives out of massive streams, Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 7120 LNAI, no. PART 1, pp. 229242, 2011.
- Y. Thakran and D. Toshniwal, Unsupervised outlier detection in streaming data using weighted clustering, 2012 12th Int. Conf. Intell. Syst. Des. Appl., pp. 947952, 2012.
- H. M. Koupaie, S. Ibrahim, and J. Hosseinkhani, Outlier Detection in Stream Data by Clustering Method, Int. J. Adv. Comput. Sci. Inf. Technol., vol. 2, no. 3, pp. 2534, 2013.
- H. M. Koupaie, S. Ibrahim, and J. Hosseinkhani, Outlier Detection in Stream Data by Machine Learning and Feature Selection Methods, Int. J. Adv. Comput. Sci. Inf. Technol., vol. 2, no. 3, pp. 2534, 2013.
- J. Natchial F., E. P., and T. B., Hybridizing Clustering and Dissimilarity Based Approach for Outlier Detection in Data Streams, Int. Sci. Press, vol. 9, no. 3, pp. 127131, 2016.
- M. Kontaki, A. Gounaris, A. N. Papadopoulos, K. Tsichlas, and Y. Manolopoulos, Efficient and flexible algorithms for monitoring distance-based outliers over data streams, Inf. Syst., vol. 55, pp. 3753, ISBN 9781424489589, 2016.
- L. Zheng, H. Huo, Y. Guo, and T. Fang, Supervised Adaptive Incremental Clustering for data stream of chunks, Neurocomputing, vol. 219, no. September 2016, pp. 502517, ISBN 18728286, 2017.
- A. Al Abd Alazeez, S. Jassim, and H. Du, EINCKM: An Enhanced Prototype-based Method for Clustering Evolving Data Streams in Big Data, Proc. 6th Int. Conf. Pattern Recognit. Appl. Methods, no. Icpram, pp. 173183, 2017.
- S. Guha, N. Mishra, R. Motwani, and L. OCallaghan, Clustering Data Streams, 0-7695-0850-2/00 $10.00 0 2000 IEEE, pp. 359366, 2000.
- A. Al Abd Alazeez, S. Jassim, and H. Du, EDDS: An Enhanced Density-Based Method for Clustering Data Streams, 2017 46th Int. Conf. Parallel Process. Work., pp. 103112, 2017.
- K. Udommanetanakit, T. Rakthanmanon, and K. Waiyamai, E-Stream: Evolution-Based Technique for Stream Clustering, Springer-Verlag Berlin, vol. 4093, no. March 2014, pp. 4255, 2007.
- H. Kremer et al., An effective evaluation measure for clustering on evolving data streams, Proc. 17th ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. - KDD 11, pp. 868876, 2011.
- F. Cao, M. Ester, W. Qian, and A. Zhou, Density-based clustering over an evolving data stream with noise, Proc. Sixth SIAM Int. Conf. Data Min., vol. 2006, pp. 328339, 2006.
- Y. Zhao and G. Karypis, Technical Report Criterion Functions for Document Clustering: Experiments and Analysis, Univ. Minnesota, Dep. Comput. Sci. / Army HPC Res. Center/ Tech. Rep., pp. 130, 2001.
- J. Silva, E. Faria, R. Barros, E. Hruschka, and A. Carvalho, Data Stream Clustering: A Survey, ACM Comput. Surv., pp. 137, 2013.
- D. A. Marcos, N. C. Rodrigo, B. Silvia, A. S. N. Marco, and B. Rajkumar, Big Data Computing and Clouds: Trends and Future Directions, J. Parallel Distrib. Comput., no. arXiv:1312.4722v2, pp. 144, 2014.
- C. Aggarwal, J. Han, J. Wang, and P. Yu, A Framework for Clustering Evolving Data Streams, Proc. 29th VLDB Conf. Ger., 2003.
- C. Isaksson, New Outlier Detection Techniques For Data Streams, Thesis, Southern Methodist University, Bobby B. Lyle School of Engineering, 2016.
- S. Ding, F. Wu, J. Qian, and H. Jia, Research on data stream clustering algorithms, Springer, vol. Artif Inte, pp. 593600, 2013.
- F. Stahl, A. Badii, M. Oldenburg, and F. Theodorstahldfkide, Building Adaptive Data Mining Models on Streaming Data in Real-Time, Comput. Intell., vol. 3, no. 2, p. 12, 2020.
- H. L. Nguyen, Y. K. Woon, and W. K. Ng, A survey on data stream clustering and classification, Knowl. Inf. Syst. Springer, pp. 535569, 2015.
- Yogita and D. Toshniwal, Clustering Techniques for Streaming Data A Survey, 3rd IEEE Int. Adv. Comput. Conf., pp. 951956, 2012.