K-means Clustering

Rodinia Website • Rodinia v4.0 K-means (GitHub) • Rodinia K-means Dataset

1. Intro & related work

(a) K-means Overview

GPU k-means clustering: given a set of n data points in a d-dimensional vector space, partition them into k groups such that each point is assigned to the cluster whose centroid minimizes the Euclidean distance. The algorithm described by MacQueen^[1] and Lloyd^[2] alternates two steps: an assignment step (nearest-centroid search for every point) and an update step (recomputing each centroid as the mean of its assigned points). In Rodinia, k-means appears as a data-mining / embarrassingly parallel workload. Its computational pattern - pairwise distances, per-point reduction to a minimum, and per-cluster accumulation - generalizes to many other data-analysis primitives including nearest-neighbor search and vector quantization.

(b) Original paper and its implementation

Algorithm: MacQueen^[1] coined the name "k-means" and described the iterative assignment-update scheme we still use today (equivalent to the formulation by Lloyd^[2] often cited interchangeably).

Original implementation: The Rodinia k-means source derives directly from Wei-keng Liao's Parallel K-Means Data Clustering package at Northwestern University^[3]. Copyright headers and author attributions in cluster.c, kmeans_clustering.c, and kmeans.h all credit Wei-keng Liao / Jay Pisharath / Northwestern ECE; the Northwestern release provides the sequential, OpenMP, and MPI variants. Rodinia later wrapped this core with a CUDA port and added a GPU kernel for the assignment step; the Rodinia CUDA adaptation is described in Che et al.^[4].

Rodinia CUDA implementation: The CUDA k-means in the suite uses a one-thread-per-point design for the assignment step with the data transposed to a column-major (SoA) layout for coalesced access. Cluster centers are placed in __constant__ memory (limited to 32 clusters × 34 features). The update step (centroid accumulation and division) is performed on the CPU in a sequential loop after each iteration. Input is a single text file with one object per line - the parser treats the first whitespace-separated token on each line as an object ID and discards it, then reads the remaining N tokens as feature values; a binary format (header of npoints, nfeatures, then raw floats) is also supported via the -b flag. Output is the best cluster count, final centroids (optional), and optional RMSE.

Suite context: Rodinia^[5]^[6] (rodinia.cs.virginia.edu). Wiki lineage: K-means (archived).

2. Datasets update

(a) Legacy datasets (old)

The Rodinia distribution^[5] ships with small sample inputs: a 100-object toy file and the kdd_cup network-intrusion sample (KDD Cup 1999^[10]) with 34 features. These are quick for smoke tests but are far too small to saturate a modern GPU and they fix the dimensionality at 34, so they are not sufficient for benchmarking the assignment step across varying d (register pressure, shared-memory tile size) or the update step across varying k (atomic collision frequency).

(b) Synthetic datasets (datagen)

The Rodinia datagen utility generates uniform random data with a numeric ID prefix per line (which the kmeans parser discards). It takes nObjects, optional nFeatures (default 34), and an optional -f switch for floats in [0.0, 1.0] versus the default integers in [0, 255]. The value range is a pure magnitude difference - both are parsed by atof() and stored as float, so GPU kernel performance is identical between the two modes.

We provide three generator scripts that sweep problem size at fixed dimensionality, covering a broad parameter space for evaluating the optimized implementation:

gen_d32.sh: d=32 at n = 100K, 500K, 1M, 5M (12 MB to 610 MB).
gen_d128.sh: d=128 at n = 100K, 500K, 1M, 5M (49 MB to 2.4 GB).
gen_d1024.sh: d=1024 at n = 100K, 500K, 1M, 5M (390 MB to 19.5 GB). The 5M case requires ~24 GB GPU memory.

(c) Real-world datasets

KDD Cup 1999^[10] - Network intrusion records with 34 numeric features. The legacy kdd_cup sample ships with Rodinia.
Covertype^[11] - US Forest Service cartographic variables, 581,012 samples × 54 features. Converted via convert_covtype.py.
HIGGS^[12] - Particle-physics Monte-Carlo simulation, 11,000,000 samples × 28 features (after removing the class label). Converted via convert_higgs.py.
USCensus1990^[13] - US census microdata, 2,458,285 records × 68 features (after removing caseid). Converted via convert_uscensus.py.
GloVe word embeddings^[14] - Pre-trained 50d/100d/200d/300d vectors over 400K to 2.2M words. Works directly with k-means (the word token is naturally consumed by the parser as the ID column); convert_glove.py is optional and produces a numeric-ID variant plus a word-mapping file.
SIFT / GIST^[16] - ANN benchmark vectors (SIFT 128d, GIST 960d) in .fvecs binary format. Converted via convert_fvecs.py with optional --binary output for the Rodinia binary input path.

3. Workloads updated

(a) Original version drawbacks

The Rodinia CUDA k-means^[4] (built on the Northwestern parallel k-means^[3]) has several design choices that were reasonable for early-era GPUs and small benchmark inputs but limit it on modern architectures and realistic datasets:

CPU-side reduction: centroid accumulation and division happen on the CPU in a sequential O(n · d) loop per iteration, which becomes the bottleneck when the GPU finishes the assignment step quickly.
Hardcoded constant-memory size: __constant__ float c_clusters[32*34] caps the problem at 32 clusters × 34 features; larger runs silently break with no fallback to main memory.
Fixed-size line buffer: the ASCII parser uses char line[1024], which overflows at d > ~100 features (SIFT 128d already crashes the stock code). We enlarged the buffer and made fgets use sizeof(line).
RMSE type bug: in cluster.c the variable collecting the return value of rms_err() was declared int, silently truncating the fractional RMSE. Fixed to float.
RMSE OpenMP race: the #pragma omp parallel for in rmse.c accumulates sum_euclid without a reduction(+:sum_euclid) clause, so multiple threads race on the shared variable and values are lost. Fixed with a proper reduction clause.
Deprecated APIs: texture references, the cudaMemcpyToSymbol string overload, and cudaThreadSynchronize() are all removed or deprecated in CUDA 10+ and broken by CUDA 12.x, so the stock code no longer compiles.
Hardcoded cudaSetDevice(1): fails on single-GPU systems (typical of laptops and most workstations).
No error checking: no cudaError_t checks on any CUDA API call, making failures silent.
Redundant host-device transfers: cluster centers are copied H-to-D and D-to-H every iteration, even though the GPU just produced them.
Assignment step loads features repeatedly: each thread loads each feature value from global memory once per cluster, for a total of n · k · d global loads per iteration with no register caching.

See the Rodinia K-means Report (in preparation)^[17] for a detailed breakdown.

(b) Related Work

Suite context comes from Che et al.^[5]^[6]^[4] and the upstream Northwestern parallel k-means^[3]. For the optimization ideas applied in kmeans_opt, the primary source is Kruliš and Kratochvíl^[7], who provide a comprehensive analysis of CUDA k-means and dissect both the assignment step (shared-memory tiling and register caching strategies) and the update step (atomic-reduction variants including their recommended AtomicFine layout). They publish an open-source reference implementation (github.com/krulis-martin/cuda-kmeans) that outperforms prior state-of-the-art by 16×-1000× across a range of (n, k, d) configurations on GTX 980 and V100 SXM2; our kmeans_opt adopts its FusedFixed assignment strategy, the "Fixed" register caching pattern, and the atomic centroid-update design. Lutz et al.^[8] independently motivate single-pass fused kernels to eliminate double loading of the dataset into GPU caches - the direct justification for our fused assignment + update kernel. Nelson and Palmieri^[9] analyze synchronization strategies and atomic-instruction tradeoffs on GPUs, supporting our choice of atomicAdd-based centroid reduction over lock- or barrier-based alternatives.

(c) Algorithm updates

Keep the legacy k-means path (cuda/kmeans/kmeans_3.1/) as a backward-compatible baseline, fixed only for CUDA 12.x API changes.
Add an optimized path (cuda/kmeans/kmeans_opt/) based on Kruliš and Kratochvíl^[7]:
- Fused assignment + update kernel: single kernel launch handles nearest-centroid search and atomic accumulation for the centroid update, eliminating the intermediate membership round-trip through global memory.
- Shared-memory tiling of cluster centers: each thread block cooperatively loads a tile of W into shared memory; removes the __constant__ memory cap and supports arbitrary k and d.
- Register caching ("Fixed" strategy): each thread keeps R_CLUSTERS partial distance sums in registers and loads each feature value from memory once per group of R_CLUSTERS clusters (default 16, paper-optimal 32). Reduces global-memory traffic by a factor of R_CLUSTERS.
- GPU-side centroid accumulation via atomicAdd: eliminates the CPU-side sequential reduction loop. A separate-kernel path offers the paper's AtomicFine variant (one thread per feature value, 2D grid) as an alternative.
- GPU-side divide kernel: new centroids are computed on the GPU, so clusters stay on device across iterations - only the convergence counter (delta) is transferred.
Infrastructure: proper CUDA_CHECK error handling on all CUDA API calls; runtime device selection; configurable block sizes for assignment (256, register-heavy) vs. update/divide (1024, simple arithmetic); split allocation so feature data is transposed once even when sweeping k.
Report kernel-only and end-to-end GPU time explicitly for direct comparison with the original.

(d) Expected behavior

Same input + same initial centroids → converge to a very close RMSE in a similar number of iterations. Bit-exact centroid equality is not expected because GPU atomic accumulation is non-deterministic in floating-point summation order.
The optimized version provides the largest speedup when k is moderate-to-large (k >= 64) where register caching amortizes feature loads, and when d <= 128 where cluster tiles fit comfortably in shared memory.
For small problems (n < 100K, k = 5), GPU launch overhead dominates and speedups are modest. For very large k · d where tiles overflow shared memory, performance gracefully degrades as the tile size is reduced.
Correctness can be verified with the verify.py script, which runs both versions with -o -r and matches centroids by nearest-neighbor plus compares RMSE within a tolerance.

(e) Future algorithmic and CUDA features

Candidate directions not yet implemented in kmeans_opt:

Incremental / delta update step: after the first iteration, most points stop changing clusters. If the kernel tracks which points did switch clusters, the centroid sums can be updated incrementally - subtract the switcher from its old cluster sum and add it to the new cluster sum - instead of re-accumulating over all n points every iteration. Stable points can be kept in a shared-memory "hot" buffer so they are loaded once per run rather than per iteration. This is distinct from the triangle-inequality filtering of YinYang-style k-means and targets the update step (where brute-force atomic reduction is already fast) rather than the assignment step.
Persistent kernel / cooperative groups: launch a single long-running kernel that loops until convergence and uses grid-wide synchronization (cudaLaunchCooperativeKernel) instead of relaunching per iteration, eliminating launch overhead for small n.
Tensor-core distance computation: express the n × k distance matrix as ||x||² + ||w||² - 2·x·w^T and use wmma or cuBLAS Sgemm for the dot-product term, especially attractive for high d and large k.
FP16 / BF16 accumulation: reduce memory bandwidth and allow larger k · d tiles to fit in shared memory, at the cost of reduced numerical precision.
AtomicShm with privatized copies: the paper^[7] reports ~2× speedup over plain global atomics by keeping a private copy of W in shared memory and flushing once per block. Not yet applied because our fused kernel currently uses the shared memory budget for cluster-center tiles; a two-phase variant could reuse the same buffer.
k-means++ seeding^[15]: replace the deterministic "first k points" initialization with a variance-weighted seeding that gives better starting centroids and measurably lower iteration counts.

4. Parameters used and Makefile

(a) Old version parameters

CLI: ./kmeans -i <input_file> [-m max_k] [-n min_k] [-t threshold] [-l nloops] [-b] [-r] [-o]. Defaults are k=5, threshold=0.001, nloops=1. Use -b for binary input, -r to compute RMSE, -o to print centroid coordinates. Internal constants are 16×16 = 256 threads per block and a __constant__ cluster buffer sized for k <= 32, d <= 34.

(b) Optimized run parameters

CLI (identical to legacy for drop-in compatibility): ./kmeans_opt -i <input_file> [options]. Makefile sets GPU (name tag) and ARCH (e.g. sm_80, sm_90). Additional knobs control the register-caching width and block sizes. Implementation uses shared-memory tiled assignment with register caching, a fused or separate update path, and a GPU-side divide kernel (see cuda/kmeans/kmeans_opt/ in the public repository).

(c) Makefile changes and notes

The optimized k-means Makefile adds explicit GPU selection, modern compile flags, and tuning knobs for the caching strategies described in Kruliš and Kratochvíl^[7].

New knobs: GPU (output name tag), ARCH (e.g. sm_80 for A100, sm_90 for H100), rclusters=16|32 (register caching width; paper-optimal is 32), separate=1 (use separate assignment + AtomicFine update kernels instead of the fused kernel), blocksize=256 and blocksize_update=1024 (per-kernel block-size overrides), dbg=1 (debug build).
New flags: optimized build settings such as -O3, -lineinfo, -use_fast_math, -std=c++11, and an explicit -arch=$(ARCH).
Notes: match ARCH to your GPU. If rclusters is set too high (e.g. 64) register spilling may reduce occupancy - 16 or 32 is recommended. If the shared-memory tile of cluster centers exceeds sharedMemPerBlock (48 KB on most architectures), the host code automatically reduces tile_k to fit.

Example runs

# Legacy k-means
cd cuda/kmeans/kmeans_3.1
make
./kmeans -o -i ../../../data/kmeans/kdd_cup

# Optimized k-means (example: A100 with paper-optimal R_CLUSTERS=32)
cd ../kmeans_opt
make GPU=a100 ARCH=sm_80 rclusters=32
./kmeans_opt -i ../../../data/kmeans/kdd_cup -m 64 -n 64 -r

# Optimized k-means with separate kernels (for profiling)
make clean && make GPU=a100 ARCH=sm_80 separate=1

# Correctness check between the two versions
python3 verify.py ../kmeans_3.1/kmeans ./kmeans_opt ../../../data/kmeans/100 -m 5 -n 5 -r

5. Data analysis

Full methodology, hardware setup, and plots: Rodinia_v4_0_Kmeans_Report.pdf

View prelim results (A100)

Results figures and discussion: To Be Documented.

6. Conclusions & code

Conclusions: To Be Documented.

View reproducible commands

Reproducible commands: To Be Documented.

References

J. MacQueen, “Some Methods for Classification and Analysis of Multivariate Observations,” in Proc. 5th Berkeley Symp. on Mathematical Statistics and Probability, vol. 1, 1967, pp. 281-297. (The paper that introduces the name "k-means"; cited by the Northwestern parallel k-means^[3] as the algorithmic basis.)
S. P. Lloyd, “Least Squares Quantization in PCM,” IEEE Trans. on Information Theory, vol. 28, no. 2, pp. 129-137, 1982. DOI: 10.1109/TIT.1982.1056489.
W.-k. Liao, “Parallel K-Means Data Clustering,” Dept. of Electrical Engineering and Computer Science, Northwestern University, 2005. Software and documentation: users.eecs.northwestern.edu/~wkliao/Kmeans. (Sequential / OpenMP / MPI implementation upstream of the Rodinia CUDA port.)
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron, “A Performance Study of General-Purpose Applications on Graphics Processors Using CUDA,” Journal of Parallel and Distributed Computing, vol. 68, no. 10, pp. 1370-1380, 2008. (Rodinia's CUDA adaptation of the Northwestern k-means.)
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, “Rodinia: A Benchmark Suite for Heterogeneous Computing,” in Proc. IEEE Int. Symp. on Workload Characterization (IISWC), 2009, pp. 44-54.
S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron, “A Characterization of the Rodinia Benchmark Suite with Comparison to Contemporary CMP Workloads,” in Proc. IEEE Int. Symp. on Workload Characterization (IISWC), 2010.
M. Kruliš and M. Kratochvíl, “Detailed Analysis and Optimization of CUDA K-means Algorithm,” in Proc. 49th Int. Conf. on Parallel Processing (ICPP), 2020. DOI: 10.1145/3404397.3404426. Source: github.com/krulis-martin/cuda-kmeans.
C. Lutz, S. Breß, T. Rabl, S. Zeuch, and V. Markl, “Efficient and Scalable k-Means on GPUs,” Datenbank-Spektrum, vol. 18, no. 3, pp. 157-169, 2018.
J. Nelson and R. Palmieri, “Don't Forget About Synchronization! A Case Study of K-Means on GPU,” in Proc. 10th Int. Workshop on Programming Models and Applications for Multicores and Manycores (PMAM), 2019, pp. 11-20.
UCI Machine Learning Repository, “KDD Cup 1999 Data,” kdd.ics.uci.edu/databases/kddcup99.
J. A. Blackard and D. J. Dean, “Comparative Accuracies of Artificial Neural Networks and Discriminant Analysis in Predicting Forest Cover Types from Cartographic Variables,” Computers and Electronics in Agriculture, vol. 24, no. 3, pp. 131-151, 1999. Dataset: archive.ics.uci.edu/dataset/31/covertype.
P. Baldi, P. Sadowski, and D. Whiteson, “Searching for Exotic Particles in High-Energy Physics with Deep Learning,” Nature Communications, vol. 5, no. 1, 2014. Dataset: archive.ics.uci.edu/dataset/280/higgs.
U.S. Census Bureau, “US Census Data (1990),” UCI Machine Learning Repository. archive.ics.uci.edu/dataset/116/us+census+data+1990.
J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global Vectors for Word Representation,” in Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532-1543. Data: nlp.stanford.edu/projects/glove.
D. Arthur and S. Vassilvitskii, “k-means++: The Advantages of Careful Seeding,” in Proc. 18th Annual ACM-SIAM Symp. on Discrete Algorithms (SODA), 2007, pp. 1027-1035. (Future-work candidate for better initialization.)
H. Jégou, M. Douze, and C. Schmid, “Product Quantization for Nearest Neighbor Search,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 117-128, 2011. ANN benchmarks: corpus-texmex.irisa.fr.
H. Nguyen, “Modernizing K-means Clustering in Rodinia v4.0 for Contemporary GPUs,” Capstone / graduation report (in preparation), Dept. of Computer Science, University of Virginia, 2026 (advisor: K. Skadron).