This page describes clustering algorithms in MLlib. The guide for clustering in the RDD-based API also has relevant information about these algorithms.

**Table of Contents**

- This will become a table of contents (this text will be scraped). {:toc}

## K-means

k-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. The MLlib implementation includes a parallelized variant of the k-means++ method called kmeans||.

`KMeans`

is implemented as an `Estimator`

and generates a `KMeansModel`

as the base model.

### Input Columns

Param name | Type(s) | Default | Description |
---|---|---|---|

featuresCol | Vector | "features" | Feature vector |

### Output Columns

Param name | Type(s) | Default | Description |
---|---|---|---|

predictionCol | Int | "prediction" | Predicted cluster center |

**Examples**

## Latent Dirichlet allocation (LDA)

`LDA`

is implemented as an `Estimator`

that supports both `EMLDAOptimizer`

and `OnlineLDAOptimizer`

, and generates a `LDAModel`

as the base model. Expert users may cast a `LDAModel`

generated by `EMLDAOptimizer`

to a `DistributedLDAModel`

if needed.

**Examples**

## Bisecting k-means

Bisecting k-means is a kind of hierarchical clustering using a divisive (or "top-down") approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

Bisecting K-means can often be much faster than regular K-means, but it will generally produce a different clustering.

`BisectingKMeans`

is implemented as an `Estimator`

and generates a `BisectingKMeansModel`

as the base model.

**Examples**

## Gaussian Mixture Model (GMM)

A Gaussian Mixture Model represents a composite distribution whereby points are drawn from one of *k* Gaussian sub-distributions, each with its own probability. The `spark.ml`

implementation uses the expectation-maximization algorithm to induce the maximum-likelihood model given a set of samples.

`GaussianMixture`

is implemented as an `Estimator`

and generates a `GaussianMixtureModel`

as the base model.

### Input Columns

Param name | Type(s) | Default | Description |
---|---|---|---|

featuresCol | Vector | "features" | Feature vector |

### Output Columns

Param name | Type(s) | Default | Description |
---|---|---|---|

predictionCol | Int | "prediction" | Predicted cluster center |

probabilityCol | Vector | "probability" | Probability of each cluster |

**Examples**

## Power Iteration Clustering (PIC)

Power Iteration Clustering (PIC) is a scalable graph clustering algorithm developed by Lin and Cohen. From the abstract: PIC finds a very low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix of the data.

`spark.ml`

's PowerIterationClustering implementation takes the following parameters:

`k`

: the number of clusters to create`initMode`

: param for the initialization algorithm`maxIter`

: param for maximum number of iterations`srcCol`

: param for the name of the input column for source vertex IDs`dstCol`

: name of the input column for destination vertex IDs`weightCol`

: Param for weight column name

**Examples**