K-means Clustering

1. Intro & related work

(a) K-means Overview

GPU k-means clustering: given a set of n data points in a d-dimensional vector space, partition them into k groups such that each point is assigned to the cluster whose centroid minimizes the Euclidean distance. The algorithm described by MacQueen[1] and Lloyd[2] alternates two steps: an assignment step (nearest-centroid search for every point) and an update step (recomputing each centroid as the mean of its assigned points). In Rodinia, k-means appears as a data-mining / embarrassingly parallel workload. Its computational pattern - pairwise distances, per-point reduction to a minimum, and per-cluster accumulation - generalizes to many other data-analysis primitives including nearest-neighbor search and vector quantization.

(b) Original paper and its implementation

Algorithm: MacQueen[1] coined the name "k-means" and described the iterative assignment-update scheme we still use today (equivalent to the formulation by Lloyd[2] often cited interchangeably).

Original implementation: The Rodinia k-means source derives directly from Wei-keng Liao's Parallel K-Means Data Clustering package at Northwestern University[3]. Copyright headers and author attributions in cluster.c, kmeans_clustering.c, and kmeans.h all credit Wei-keng Liao / Jay Pisharath / Northwestern ECE; the Northwestern release provides the sequential, OpenMP, and MPI variants. Rodinia later wrapped this core with a CUDA port and added a GPU kernel for the assignment step; the Rodinia CUDA adaptation is described in Che et al.[4].

Rodinia CUDA implementation: The CUDA k-means in the suite uses a one-thread-per-point design for the assignment step with the data transposed to a column-major (SoA) layout for coalesced access. Cluster centers are placed in __constant__ memory (limited to 32 clusters × 34 features). The update step (centroid accumulation and division) is performed on the CPU in a sequential loop after each iteration. Input is a single text file with one object per line - the parser treats the first whitespace-separated token on each line as an object ID and discards it, then reads the remaining N tokens as feature values; a binary format (header of npoints, nfeatures, then raw floats) is also supported via the -b flag. Output is the best cluster count, final centroids (optional), and optional RMSE.

Suite context: Rodinia[5][6] (rodinia.cs.virginia.edu). Wiki lineage: K-means (archived).

2. Datasets update

(a) Legacy datasets (old)

The Rodinia distribution[5] ships with small sample inputs: a 100-object toy file and the kdd_cup network-intrusion sample (KDD Cup 1999[10]) with 34 features. These are quick for smoke tests but are far too small to saturate a modern GPU and they fix the dimensionality at 34, so they are not sufficient for benchmarking the assignment step across varying d (register pressure, shared-memory tile size) or the update step across varying k (atomic collision frequency).

(b) Synthetic datasets (datagen)

The Rodinia datagen utility generates uniform random data with a numeric ID prefix per line (which the kmeans parser discards). It takes nObjects, optional nFeatures (default 34), and an optional -f switch for floats in [0.0, 1.0] versus the default integers in [0, 255]. The value range is a pure magnitude difference - both are parsed by atof() and stored as float, so GPU kernel performance is identical between the two modes.

We provide three generator scripts that sweep problem size at fixed dimensionality, covering a broad parameter space for evaluating the optimized implementation:

(c) Real-world datasets

3. Workloads updated

(a) Original version drawbacks

The Rodinia CUDA k-means[4] (built on the Northwestern parallel k-means[3]) has several design choices that were reasonable for early-era GPUs and small benchmark inputs but limit it on modern architectures and realistic datasets:

See the Rodinia K-means Report (in preparation)[17] for a detailed breakdown.

(b) Related Work

Suite context comes from Che et al.[5][6][4] and the upstream Northwestern parallel k-means[3]. For the optimization ideas applied in kmeans_opt, the primary source is Kruliš and Kratochvíl[7], who provide a comprehensive analysis of CUDA k-means and dissect both the assignment step (shared-memory tiling and register caching strategies) and the update step (atomic-reduction variants including their recommended AtomicFine layout). They publish an open-source reference implementation (github.com/krulis-martin/cuda-kmeans) that outperforms prior state-of-the-art by 16×-1000× across a range of (n, k, d) configurations on GTX 980 and V100 SXM2; our kmeans_opt adopts its FusedFixed assignment strategy, the "Fixed" register caching pattern, and the atomic centroid-update design. Lutz et al.[8] independently motivate single-pass fused kernels to eliminate double loading of the dataset into GPU caches - the direct justification for our fused assignment + update kernel. Nelson and Palmieri[9] analyze synchronization strategies and atomic-instruction tradeoffs on GPUs, supporting our choice of atomicAdd-based centroid reduction over lock- or barrier-based alternatives.

(c) Algorithm updates

(d) Expected behavior

(e) Future algorithmic and CUDA features

Candidate directions not yet implemented in kmeans_opt:

4. Parameters used and Makefile

(a) Old version parameters

CLI: ./kmeans -i <input_file> [-m max_k] [-n min_k] [-t threshold] [-l nloops] [-b] [-r] [-o]. Defaults are k=5, threshold=0.001, nloops=1. Use -b for binary input, -r to compute RMSE, -o to print centroid coordinates. Internal constants are 16×16 = 256 threads per block and a __constant__ cluster buffer sized for k <= 32, d <= 34.

(b) Optimized run parameters

CLI (identical to legacy for drop-in compatibility): ./kmeans_opt -i <input_file> [options]. Makefile sets GPU (name tag) and ARCH (e.g. sm_80, sm_90). Additional knobs control the register-caching width and block sizes. Implementation uses shared-memory tiled assignment with register caching, a fused or separate update path, and a GPU-side divide kernel (see cuda/kmeans/kmeans_opt/ in the public repository).

(c) Makefile changes and notes

The optimized k-means Makefile adds explicit GPU selection, modern compile flags, and tuning knobs for the caching strategies described in Kruliš and Kratochvíl[7].

Example runs

# Legacy k-means
cd cuda/kmeans/kmeans_3.1
make
./kmeans -o -i ../../../data/kmeans/kdd_cup

# Optimized k-means (example: A100 with paper-optimal R_CLUSTERS=32)
cd ../kmeans_opt
make GPU=a100 ARCH=sm_80 rclusters=32
./kmeans_opt -i ../../../data/kmeans/kdd_cup -m 64 -n 64 -r

# Optimized k-means with separate kernels (for profiling)
make clean && make GPU=a100 ARCH=sm_80 separate=1

# Correctness check between the two versions
python3 verify.py ../kmeans_3.1/kmeans ./kmeans_opt ../../../data/kmeans/100 -m 5 -n 5 -r

5. Data analysis

Full methodology, hardware setup, and plots: Rodinia_v4_0_Kmeans_Report.pdf

View prelim results (A100)
Results figures and discussion: To Be Documented.

6. Conclusions & code

Conclusions: To Be Documented.
View reproducible commands
Reproducible commands: To Be Documented.

References

  1. J. MacQueen, “Some Methods for Classification and Analysis of Multivariate Observations,” in Proc. 5th Berkeley Symp. on Mathematical Statistics and Probability, vol. 1, 1967, pp. 281-297. (The paper that introduces the name "k-means"; cited by the Northwestern parallel k-means[3] as the algorithmic basis.)
  2. S. P. Lloyd, “Least Squares Quantization in PCM,” IEEE Trans. on Information Theory, vol. 28, no. 2, pp. 129-137, 1982. DOI: 10.1109/TIT.1982.1056489.
  3. W.-k. Liao, “Parallel K-Means Data Clustering,” Dept. of Electrical Engineering and Computer Science, Northwestern University, 2005. Software and documentation: users.eecs.northwestern.edu/~wkliao/Kmeans. (Sequential / OpenMP / MPI implementation upstream of the Rodinia CUDA port.)
  4. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron, “A Performance Study of General-Purpose Applications on Graphics Processors Using CUDA,” Journal of Parallel and Distributed Computing, vol. 68, no. 10, pp. 1370-1380, 2008. (Rodinia's CUDA adaptation of the Northwestern k-means.)
  5. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, “Rodinia: A Benchmark Suite for Heterogeneous Computing,” in Proc. IEEE Int. Symp. on Workload Characterization (IISWC), 2009, pp. 44-54.
  6. S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron, “A Characterization of the Rodinia Benchmark Suite with Comparison to Contemporary CMP Workloads,” in Proc. IEEE Int. Symp. on Workload Characterization (IISWC), 2010.
  7. M. Kruliš and M. Kratochvíl, “Detailed Analysis and Optimization of CUDA K-means Algorithm,” in Proc. 49th Int. Conf. on Parallel Processing (ICPP), 2020. DOI: 10.1145/3404397.3404426. Source: github.com/krulis-martin/cuda-kmeans.
  8. C. Lutz, S. Breß, T. Rabl, S. Zeuch, and V. Markl, “Efficient and Scalable k-Means on GPUs,” Datenbank-Spektrum, vol. 18, no. 3, pp. 157-169, 2018.
  9. J. Nelson and R. Palmieri, “Don't Forget About Synchronization! A Case Study of K-Means on GPU,” in Proc. 10th Int. Workshop on Programming Models and Applications for Multicores and Manycores (PMAM), 2019, pp. 11-20.
  10. UCI Machine Learning Repository, “KDD Cup 1999 Data,” kdd.ics.uci.edu/databases/kddcup99.
  11. J. A. Blackard and D. J. Dean, “Comparative Accuracies of Artificial Neural Networks and Discriminant Analysis in Predicting Forest Cover Types from Cartographic Variables,” Computers and Electronics in Agriculture, vol. 24, no. 3, pp. 131-151, 1999. Dataset: archive.ics.uci.edu/dataset/31/covertype.
  12. P. Baldi, P. Sadowski, and D. Whiteson, “Searching for Exotic Particles in High-Energy Physics with Deep Learning,” Nature Communications, vol. 5, no. 1, 2014. Dataset: archive.ics.uci.edu/dataset/280/higgs.
  13. U.S. Census Bureau, “US Census Data (1990),” UCI Machine Learning Repository. archive.ics.uci.edu/dataset/116/us+census+data+1990.
  14. J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global Vectors for Word Representation,” in Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532-1543. Data: nlp.stanford.edu/projects/glove.
  15. D. Arthur and S. Vassilvitskii, “k-means++: The Advantages of Careful Seeding,” in Proc. 18th Annual ACM-SIAM Symp. on Discrete Algorithms (SODA), 2007, pp. 1027-1035. (Future-work candidate for better initialization.)
  16. H. Jégou, M. Douze, and C. Schmid, “Product Quantization for Nearest Neighbor Search,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 117-128, 2011. ANN benchmarks: corpus-texmex.irisa.fr.
  17. H. Nguyen, “Modernizing K-means Clustering in Rodinia v4.0 for Contemporary GPUs,” Capstone / graduation report (in preparation), Dept. of Computer Science, University of Virginia, 2026 (advisor: K. Skadron).