GoBug: A Novel Defect Dataset for the Go Programming Language

Yılmaz, Emre; Oktaş, Recai

doi:10.1109/access.2026.3682160

GoBug: A Novel Defect Dataset for the Go Programming Language

Yılmaz E. C., Oktaş R.

IEEE ACCESS, cilt.14, ss.55248-55266, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 14
Basım Tarihi: 2026
Doi Numarası: 10.1109/access.2026.3682160
Dergi Adı: IEEE ACCESS
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC, Directory of Open Access Journals
Sayfa Sayıları: ss.55248-55266
Ondokuz Mayıs Üniversitesi Adresli: Evet

Özet

The Go ecosystem has become a major platform for cloud-native and infrastructure software, yet large-scale defect-prediction benchmarks remain heavily focused on Java, C, and C++. To fill this gap, we present GoBug, a frozen multi-granularity defect dataset for the Go programming language, providing labeled instances at the commit, file, and method levels across 16 widely used open-source projects. The released corpus contains 258,946 labeled instances in total, comprising 28,053 commit-level, 46,961 file-level, and 183,932 method-level samples, and covers bug-labeled pull requests created between May 2014 and June 2025. GoBug combines process metrics, static code metrics, and AST-derived Go-aware features to support controlled studies on granularity, temporal evaluation, and feature families under a documented collection and evaluation workflow. We establish a broad baseline by evaluating 11 machine learning algorithms with standard resampling-based class imbalance treatments and report aggregated project-level comparisons under strict chronological holdout testing to prevent data leakage. Our results show that commit-level (Just-In-Time) prediction yields the highest observed aggregate MCC among the evaluated granularity settings, underscoring the central role of process metrics relative to static code features in this setting. Furthermore, we find that Go-specific metrics provide limited incremental value in temporal evaluations, while boosting algorithms (e.g., CatBoost, XGBoost) remain competitive in representative default-versus-tuned comparisons.