A Method to Information Retrieval from Missing Data Using Data Mining Techniques and Genetic Algorithm

Document Type : Original Article

Authors

1 Assistant Professor, Department of Computer Engineering, Faculty of Engineering, Bozorgmehr University of Qaenat, Qaenat, Iran.

2 Assistant Professor, ICT Research Institute (ITRC), Tehran, Iran.

Abstract

Purpose: In statistical literature, various terms—often used interchangeably—refer to the concept of missing data. These include missing data, lost data, incomplete data, nonresponse data, and others. In statistics, missing data or missing values occur when no data values are recorded for a variable in a given observation. Data are often lost in economic, sociological, and political science research because government or private entities may provide incomplete reports, some study participants may withdraw from participation or avoid answering certain questions, or researchers, technicians, and data collectors may make errors that result in data loss. Missing data can disrupt the distribution of variables, potentially causing model overfitting or underfitting. They can also introduce bias into a dataset, thereby skewing statistical analyses toward biased results and making it difficult to draw meaningful conclusions from the collected data. Moreover, they can lead to incorrect model analysis. Traditionally, the most common method for addressing missing data was simply to remove them, which resulted in low-quality datasets and consequently biased analyses and findings. Today, with scientific advances in various fields and the emergence of powerful statistical methods, missing values in incomplete datasets can be appropriately imputed or estimated prior to modeling. Given the importance of managing missing data, the present study aims to propose a method for improving the accuracy of information and knowledge retrieval from missing data.
Method: The proposed method employs data mining techniques, including clustering and regression, as well as heuristic algorithms such as genetic algorithms. In existing methods, the entire dataset is used to impute missing values. This approach often includes records that are dissimilar to the one with missing data, leading to inaccurate results. In the proposed algorithm, clustering is used to identify similar records. Then, for each cluster, the proportion of missing data for each attribute (column) is calculated. Based on this proportion, either a regression model or a genetic algorithm is applied to recover the missing data.
Findings: The implementation of the proposed method on a dataset with randomly missing data showed that the error rate of the algorithm was 27%. This rate was significantly lower than those of other methods: mean, median, and mode substitution methods (56.5%), the regression method (34.6%), and the support vector machine (SVM) method (42.1%). These results demonstrate higher accuracy in imputing missing data.
Conclusion: Existing methods use the entire dataset to replace missing values, which often leads to the inclusion of dissimilar records and consequently produces inaccurate results. The proposed algorithm addresses this issue by employing clustering to identify similar records and estimate missing data based on records within the same cluster. Additionally, the algorithm incorporates outlier removal, determination of the optimal number of clusters, and other refinements to ensure that abnormal data do not influence the estimation of missing values. Attributes (columns) with more than one-third missing data are removed to prevent unreliable data from affecting the estimation process. Furthermore, regression models within clusters consider related attributes when estimating missing values. The integration of a genetic algorithm, which combines mean, median, mode, and regression models, results in more reliable and accurate outcomes.

Keywords

Main Subjects


Abedpour, K., Hosseini Shirvani, M., & Abedpour, E. (2024). A genetic-based clustering algorithm for efficient resource allocating of IoT applications in layered fog heterogeneous platforms. Cluster Computing, 27(2), 1313-1331. https://doi.org/10.1007/s10586-023-04005-x
Awad, F. H., Hamad, M. M., & Alzubaidi, L. (2023). Robust classification and detection of big medical data using advanced parallel K-means clustering, YOLOv4, and logistic regression. Life, 13(3), 1-34. https://doi.org/10.3390/life13030691
Baghi Yazdel, R., Jamali, E., Khodaei, E., & Habibi, M. (2016). Methods of Dealing with Missing Data: Advantages, Disadvantages, Theoretical Approaches and Application of Software. Higher Education Letter, 9(33), 11-37. [In Persian]
Biessmann, F., Rukat, T., Schmidt, P., Naidu, P., Schelter, S., Taptunov, A., ... & Salinas, D. (2019). DataWig: Missing value imputation for tables. Journal of Machine Learning Research, 20(175), 1-6.
Emmanuel, T., Maupong, T., Mpoeleng, D., Semong, T., Mphago, B., & Tabona, O. (2021). A survey on missing data in machine learning. Journal of Big data, 8, 1-37.
https://doi.org/10.1186/s40537-021-00516-9
Fasihi, B., Azizi, H., & Gholizadeh Gazvar, Z. (2021). Data envelopment analysis with missing data. Modern Research in Decision Making, 6(1), 201-229. [In Persian]     
https://doi.org/ 20.1001.1.24766291.1400.6.1.9.7
Heidari, J., Daneshpour, N., & Zangeneh, A. (2024). A novel K-means and K-medoids algorithms for clustering non-spherical-shape clusters non-sensitive to outliers. Pattern Recognition, 1-10.      
https://doi.org/10.1016/j.patcog.2024.110639
Kazemi E, Karimlo M, Rahgozar M. A review of missing data. (2011). MEJDS; 1(1), 47-52.      
https://doi.org/20.1001.1.23222840.1390.1.1.3.1 [In Persian]
Khademi Dehnavi, M., Broumandnia, A., Hosseini Shirvani, M., & Ahanian, I. (2024). A hybrid genetic-based task scheduling algorithm for cost-efficient workflow execution in heterogeneous cloud computing environment. Cluster Computing, 1-26.         
https://doi.org/10.1007/s10586-024-04468-6
Li, L., Zhou, H., Liu, H., Zhang, C., & Liu, J. (2021). A hybrid method coupling empirical mode decomposition and a long short-term memory network to predict missing measured signal data of SHM systems. Structural Health Monitoring, 20(4), 1778-1793.
https://doi.org/10.1177/1475921720932813
Liu, H., Li, J., Wu, Y., & Fu, Y. (2019). Clustering with outlier removal. IEEE transactions on knowledge and data engineering, 33(6), 2369-2379.             
https://doi.org/10.1109/TKDE.2019.2954317
Liu, Y., Dillon, T., Yu, W., Rahayu, W., & Mostafa, F. (2020). Missing value imputation for industrial IoT sensor data with large gaps. IEEE Internet of Things Journal, 7(8), 6855-6867.     
https://doi.org/10.1109/JIOT.2020.2970467
Rashidi-nejad, A. (2015). Comparison of population average estimates based on relative bias methods in sample surveys. Iranian Journal of Official Statistics Studies. 1(86), 51-64. [In Persian]
Tada, M., Suzuki, N., & Okada, Y. (2022). Missing Value Imputation Method for Multiclass Matrix Data Based on Closed Itemset. Entropy, 24(2), 1-15. https://doi.org/10.3390/e24020286
Wang, Y., Wei, Y., Wang, X., Wang, Z., & Wang, H. (2023). A clustering-based extended genetic algorithm for the multi depot vehicle routing problem with time windows and three-dimensional loading constraints. Applied Soft Computing, 133, 1-10.         
https://doi.org/10.1016/j.asoc.2022.109922
YiFei, L., Minh, H. L., Khatir, S., Sang-To, T., Cuong-Le, T., MaoSen, C., & Wahab, M. A. (2023). Structure damage identification in dams using sparse polynomial chaos expansion combined with hybrid K-means clustering optimizer and genetic algorithm. Engineering Structures, 283, 1-11. https://doi.org/10.1016/j.engstruct.2023.115891
Zhang, Y., Zhou, B., Cai, X., Guo, W., Ding, X., & Yuan, X. (2021). Missing value imputation in multivariate time series with end-to-end generative adversarial networks. Information Sciences, 551, 67-82. https://doi.org/10.1016/j.ins.2020.11.035
CAPTCHA Image