Document Type : Original Article
Authors
1
Assistant Professor, Department of Computer Engineering, Faculty of Engineering, Bozorgmehr University of Qaenat, Qaenat, Iran.
2
Assistant Professor, ICT Research Institute (ITRC), Tehran, Iran.
10.22091/stim.2024.10668.2092
Abstract
Abstract
Purpose: In the statistical literature, there are different and often synonymous terms for the concept of missing data. These terms include missing data, incomplete data, etc. In statistics, missing data or missing values occur when no data values are stored for a variable in an observation. Data are often lost in economic, sociological, and political science research because government or private entities may provide sensitive reports incompletely, or some study participants may refuse to continue cooperating, or may not respond to some questions, or researchers, technicians, and data collectors may make mistakes that result in missing data. Missing data can cause disturbance in the distribution of the variable, that is, it can cause models to overfit or underfit. Missing data can cause a bias in the data set and therefore lead the statistical analysis to biased results and ultimately make it difficult to reach a useful conclusion from the collected data and can lead to incorrect analysis of the model. Previously, to overcome the problem of missing data, the most common method was to remove missing data, which led to low-quality data and, as a result, biased analysis. Today, with scientific advances in various fields and the emergence of powerful statistical methods, it is possible to substitute or estimate missing values with appropriate values before modeling incomplete data. Considering the mentioned importance of the issue of exposure and management of missing data, the present research was done with the aim of providing a method to improve the accuracy of information and knowledge retrieval from missing data.
Method: In the proposed method, data mining techniques including clustering and regression, as well as heuristic algorithms including genetic algorithm are used. In the existing methods, the whole data set is used to retrieval the missing data. This issue will cause the consideration of records that are not similar to the record related to the missing data. Therefore, it will lead to wrong results. In the proposed algorithm, clustering is used to identify similar records. Then, for each cluster, the amount of missing data of each attribute (column) of the data set has been calculated. Based on the amount of missing data, a regression model or a genetic algorithm has been used to retrieval the missing data.
Findings: The results of the implementation of the proposed method on a data set that contained randomly missing data showed that the error rate of the proposed algorithm is 27%, compared to the method of using the mean, median, and mode, which has an error of 56.5%, and the method of using regression, which has an error of 34.6%, and the support vector machine (SVM) method, which has an error of 42.1%, has a higher accuracy in missing data.
Conclusion: In the existing methods, the entire data set is used to retrieval the missing data. This issue will cause the consideration of records that are not similar to the record related to the missing data. Therefore, it will lead to wrong results. In the proposed algorithm, clustering is used to identify similar records, and to calculate missing data based on similar records in the cluster. Also, in the proposed algorithm, outlier data removal, determining the number of optimal clusters, etc. are considered. This issue will cause abnormal data to have no effect on the calculation of missing data. In the proposed algorithm, for each cluster, attributes (columns) that have more than one third of missing data are removed. This issue will prevent the influence of unreliable data in the calculation of missing data. Also, the regression model is used in the cluster, which causes the relevant fields in other attributes (columns) to be considered in the calculation of missing data. The use of genetic algorithm in the proposed method, which leads to the combined use of mean, median, mode and regression model, will lead to more acceptable results.
Keywords: Information Retrieval, Missing Data, Data Mining, Genetic Algorithm, Clustering, Regression Model.
Keywords
Main Subjects
Send comment about this article