نوع مقاله : مقاله پژوهشی
نویسندگان
1 استادیار گروه مهندسی کامپیوتر، دانشکده مهندسی، دانشگاه بزرگمهر قائنات، قائنات، ایران.
2 استادیار، پژوهشگاه ارتباطات و فناوری اطلاعات، تهران، ایران.
چکیده
کلیدواژهها
موضوعات
عنوان مقاله [English]
نویسندگان [English]
Purpose: In statistical literature, various terms—often used interchangeably—refer to the concept of missing data. These include missing data, lost data, incomplete data, nonresponse data, and others. In statistics, missing data or missing values occur when no data values are recorded for a variable in a given observation. Data are often lost in economic, sociological, and political science research because government or private entities may provide incomplete reports, some study participants may withdraw from participation or avoid answering certain questions, or researchers, technicians, and data collectors may make errors that result in data loss. Missing data can disrupt the distribution of variables, potentially causing model overfitting or underfitting. They can also introduce bias into a dataset, thereby skewing statistical analyses toward biased results and making it difficult to draw meaningful conclusions from the collected data. Moreover, they can lead to incorrect model analysis. Traditionally, the most common method for addressing missing data was simply to remove them, which resulted in low-quality datasets and consequently biased analyses and findings. Today, with scientific advances in various fields and the emergence of powerful statistical methods, missing values in incomplete datasets can be appropriately imputed or estimated prior to modeling. Given the importance of managing missing data, the present study aims to propose a method for improving the accuracy of information and knowledge retrieval from missing data.
Method: The proposed method employs data mining techniques, including clustering and regression, as well as heuristic algorithms such as genetic algorithms. In existing methods, the entire dataset is used to impute missing values. This approach often includes records that are dissimilar to the one with missing data, leading to inaccurate results. In the proposed algorithm, clustering is used to identify similar records. Then, for each cluster, the proportion of missing data for each attribute (column) is calculated. Based on this proportion, either a regression model or a genetic algorithm is applied to recover the missing data.
Findings: The implementation of the proposed method on a dataset with randomly missing data showed that the error rate of the algorithm was 27%. This rate was significantly lower than those of other methods: mean, median, and mode substitution methods (56.5%), the regression method (34.6%), and the support vector machine (SVM) method (42.1%). These results demonstrate higher accuracy in imputing missing data.
Conclusion: Existing methods use the entire dataset to replace missing values, which often leads to the inclusion of dissimilar records and consequently produces inaccurate results. The proposed algorithm addresses this issue by employing clustering to identify similar records and estimate missing data based on records within the same cluster. Additionally, the algorithm incorporates outlier removal, determination of the optimal number of clusters, and other refinements to ensure that abnormal data do not influence the estimation of missing values. Attributes (columns) with more than one-third missing data are removed to prevent unreliable data from affecting the estimation process. Furthermore, regression models within clusters consider related attributes when estimating missing values. The integration of a genetic algorithm, which combines mean, median, mode, and regression models, results in more reliable and accurate outcomes.
کلیدواژهها [English]
ارسال نظر در مورد این مقاله