Enhanced Cross-Validation Methods Leveraging Clustering Techniques

Yucelbas, Cuneyt; Yucelbas, Sule

Enhanced Cross-Validation Methods Leveraging Clustering Techniques

dc.contributor.author	Yucelbas, Cuneyt
dc.contributor.author	Yucelbas, Sule
dc.date.accessioned	2025-03-17T12:25:24Z
dc.date.available	2025-03-17T12:25:24Z
dc.date.issued	2023
dc.department	Tarsus Üniversitesi
dc.description.abstract	The efficacy of emerging and established learning algorithms warrants scrutiny. This examination is intrinsically linked to the results of classification performance. The primary determinant influencing these results is the distribution of the training and test data presented to the algorithms. Existing literature frequently employs standard and stratified (S-CV and St-CV) k-fold cross-validation methods for the creation of training and test data for classification tasks. In the S-CV method, training and test groups are formed via random data distribution, potentially undermining the reliability of performance results calculated post-classification. This study introduces innovative cross-validation strategies based on k -means and k-medoids clustering to address this challenge. These strategies are designed to tackle issues emerging from random data distribution. The proposed methods autonomously determine the number of clusters and folds. Initially, the number of clusters is established via Silhouette analysis, followed by identifying the number of folds according to the data volume within these clusters. An additional aim of this study is to minimize the standard deviation (Std) values between the folds. Particularly in classifying large datasets, the minimized Std negates the need to present each fold to the system, thereby reducing time expenditure and system congestion/fatigue. Analyses were carried out on several large-scale datasets to demonstrate the superiority of these new CV methods over the S-CV and St-CV techniques. The findings revealed superior performance results for the novel strategies. For instance, while the minimum Std value between folds was 0.022, the maximum accuracy rate achieved was approximately 100%. Owing to the proposed methods, the discrepancy between the performance outputs of each fold and the overall average is statistically minimized. The randomness in creating the training/test groups, which has been previously identified as a negative contributing factor to this discrepancy, has been significantly reduced. Hence, this study is anticipated to fill a critical and substantial gap in the existing literature concerning the formation of training/test groups in various classification problems and the statistical accuracy of performance results.
dc.identifier.doi	10.18280/ts.400626
dc.identifier.endpage	2660
dc.identifier.issn	0765-0019
dc.identifier.issn	1958-5608
dc.identifier.issue	6
dc.identifier.scopusquality	N/A
dc.identifier.startpage	2649
dc.identifier.uri	https://doi.org/10.18280/ts.400626
dc.identifier.uri	https://hdl.handle.net/20.500.13099/1665
dc.identifier.volume	40
dc.identifier.wos	WOS:001137494800014
dc.identifier.wosquality	Q4
dc.indekslendigikaynak	Web of Science
dc.language.iso	en
dc.publisher	Int Information & Engineering Technology Assoc
dc.relation.ispartof	Traitement Du Signal
dc.relation.publicationcategory	Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı
dc.rights	info:eu-repo/semantics/openAccess
dc.snmz	KA_WOS_20250316
dc.subject	large-scale classification
dc.subject	cross-validation methodology
dc.subject	k-means
dc.subject	k-medoids
dc.subject	clustering techniques
dc.title	Enhanced Cross-Validation Methods Leveraging Clustering Techniques
dc.type	Article

Koleksiyon

WoS İndeksli Yayınlar Koleksiyonu

Enhanced Cross-Validation Methods Leveraging Clustering Techniques

Dosyalar

Koleksiyon