# Improved K-means Clustering Model

No preview image

### 1 collaborator

Tânia Duarte (Author)

### Tags

(This model has yet to be categorized with any tags)
Visible to everyone | Changeable by the author
Model was written in NetLogo 6.0.1 • Viewed 226 times • Downloaded 9 times • Run 0 times

## WHAT IS IT?

Often, when we have data about a large set of objects, we want to identify groups of similar objects within that set. For example, if we have many news articles, we may want to identify groups of articles that all have the same topic. In statistics, this task is called “cluster analysis”, or “clustering”.

This model shows the k-means clustering algorithm. a simple, but often effective approach to clustering. In this model, the k-means clustering algorithm is used to identify clusters of points on a plane. In the general case, you can represent your data objects as vectors of numbers, where each number represents a feature of the object.

The model is initialized by creating a specific number of clusters (NUM-CLUSTERS). In a real-world application, this number of clusters would be unknown. The algorithm is designed to find the clusters, and the user selects a guess, NUM-CENTROIDS, which tells the algorithm, how many clusters to search for. You can search for different numbers of clusters till you find one that best characterizes your data.

The result of the algorithm is a set of NUM-CENTROIDS points, (each point is called a centroid), each of which is located at the average position of a corresponding cluster. The “k” in k-mean clustering refers to the guess at how many centroids to search for.

The purpose of the model is to allow you to see how the number of data points, clusters, and centroids interact, and how the algorithm can often find many different sets of clusters for the same dataset.

## HOW IT WORKS

The algorithm works by finding the average position of the elements in NUM-CENTROIDS different clusters. It starts by simply guessing an average position for each of the NUM-CENTROIDS clusters, and then improves these guesses as it goes on.

When the model is set up, two things happen: First, NUM-DATA-POINTS data points are generated and shown in the view. Second, NUM-CENTROIDS centroids are generated and moved to a randomly selected data point. This randomly selected point is the initial guess for each centroid.

Each tick, the model performs two steps:

1. ASSIGN POINTS: all data points assign themselves to their closest centroid, taking on its color.

2. UPDATE CENTROIDS: all centroids move to the average position of the points assigned to it.

By iterating these steps a few times, the model is usually able to identify clusters in the dataset.

## HOW TO USE IT

First choose how many clusters you want to create and with how many data points, using respectively the NUM-CLUSTERS slider and the NUM-DATA-POINTS slider. Using the NUM-CENTROIDS slider, you can decide how many centroids you will create, which is how many clusters to search for. With real data, you won’t necessarily know how many naturally occurring clusters there are. By making NUM-CLUSTERS and NUM-CENTROIDS different, you can see what the algorithm does when it’s looking for too many or not enough clusters. Finally, click SETUP to generate your dataset and your centroids.

At this point you can choose to call each of the two steps manually using the ASSIGN POINTS, and UPDATE CENTROIDS buttons. This lets you see exactly how each of the steps work.

Alternatively you can press the FIND CLUSTERS (GO) button.

The model stops when the points assigned to the centroids don’t change. When this happens, we say that the centroids have “converged”.

RESET CENTROIDS sets the position of the centroids to random points in your data set, so that you can see if the clustering algorithm finds different sets of clusters depending on its initial guesses.

## THINGS TO NOTICE

You will often get very different results, depending on the initial guesses for the centroids’ location. Try and explore each data set several times, both by resetting the centroids, and by changing the number of centroids you place in the dataset.

## THINGS TO TRY

K-means clustering won't necessarily find the best solution. In fact, there are many solutions that it can converge to. Each of these solutions will be locally optimal. In other words, you can't find a better solution by moving the centroids by a small amount. However, better solutions may still exist.

For each dataset, you will probably be able to converge on many different solutions. You can try this by clicking the RESET CENTROIDS button after your model has already converged on one set, and running it again. This becomes particularly obvious if NUM-CLUSTERS and NUM-CENTROIDS are not equal.

## EXTENDING THE MODEL

The dataset currently is distributed in clusters across a 2-d space. It could be interesting to add the ability to either import datasets and find clusters in them, or somehow allow the user to create their own datasets.

Currently, only the data points that are identified as belonging to a cluster change their color as the algorithm runs. Another interesting way to illustrate clusters might be to change the color of patches that are closest to the centroids. This would create a Voronoi diagram of the clusters (see under 'Related models' if you are curious about Voronoi diagrams).

A clustering algorithm requires a measure to assess bow well it has clustered. This model uses the square deviation of the Euclidean distance. This measure has some good properties, but it assigns better scores to higher numbers of centroids. Can you think of other measures that improve the clustering?

## NETLOGO FEATURES

_CREATE-TEMPORARY-PLOT-PEN_: The model plots the number of data points for each of the centroids. However, because this number might change during runtime (i.e. if you let the model run, change the number of clusters, and then run the model again), the model uses temporary pens that are created and die with the centroids.

## RELATED MODELS

* Voronoi

* Emergent Voronoi

* MaterialSim Grain Growth

* Fur

* Honeycomb

* Scatter

* Hotelling's Law

## REFERENCES

For more information on k-means clustering, see https://en.wikipedia.org/wiki/K-means_clustering. (There are also many other sites on this topic on the web.)

## HOW TO CITE

If you mention this model or the NetLogo software in a publication, we ask that you include the citations below.

For the model itself:

* Hjorth, A., Head, B. and Wilensky, U. (2014). NetLogo K-Means Clustering model. http://ccl.northwestern.edu/netlogo/models/K-MeansClustering. Center for Connected Learning and Computer-Based Modeling, Northwestern University, Evanston, IL.

Please cite the NetLogo software as:

* Wilensky, U. (1999). NetLogo. http://ccl.northwestern.edu/netlogo/. Center for Connected Learning and Computer-Based Modeling, Northwestern University, Evanston, IL.

Copyright 2014 Uri Wilensky.

![CC BY-NC-SA 3.0](http://ccl.northwestern.edu/images/creativecommons/byncsa.png)

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/3.0/ or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.

## Comments and Questions

Click to Run Model

```breed [datapoints datapoint]
breed [centroids centroid]

globals [any-centroids-moved?]

to setup
clear-all
set-default-shape datapoints "circle"
set-default-shape centroids "x"
generate-datapoints
reset-centroids
end

to setup2
clear-all
set-default-shape datapoints "circle"
set-default-shape centroids "x"
generate-datapoints
reset-centroids3
reset-ticks
end

to generate-datapoints
repeat num-datapoints [
let center-x random-xcor / 1.5
let center-y random-ycor / 1.5
create-datapoints num-datapoints [
setxy center-x center-y
set heading random 360
]]
end

to reset-centroids
set any-centroids-moved? true
ask datapoints [ set color grey ]

let colors base-colors
create-centroids num-centroids [
move-to one-of datapoints
set size 5
set color last colors + 1
set colors butlast colors
]
;;clear-all-plots
reset-ticks
end

to reset-centroids2
set any-centroids-moved? true
ask datapoints [ set color grey ]

let colors base-colors
create-centroids num-centroids [
move-to one-of datapoints
set size 5
set color last colors + 1
set colors butlast colors
]
end

to reset-centroids3
set any-centroids-moved? true
ask datapoints [ set color grey ]

let colors base-colors
create-centroids 1 [
move-to one-of datapoints
set size 5
set color last colors + 1
set colors butlast colors
]
end

to go
if not any-centroids-moved? [stop]
set any-centroids-moved? false
assign-clusters
update-clusters
tick
end

to go2
set num-centroids (ticks + 1)
reset-centroids2
while[any-centroids-moved?][
set any-centroids-moved? false
assign-clusters
update-clusters2
]
tick
end

to assign-clusters
ask datapoints [set color ([color] of (closest-centroid) - 2)]
end

to update-clusters
let movement-threshold 0.1
let my-points datapoints with [ shade-of? color [ color ] of myself ]
if any? my-points [
let new-xcor mean [ xcor ] of my-points
let new-ycor mean [ ycor ] of my-points
if distancexy new-xcor new-ycor > movement-threshold [
set any-centroids-moved? true
]
setxy new-xcor new-ycor
]
]
update-plots
end

to update-clusters2
let movement-threshold 0.1
let my-points datapoints with [ shade-of? color [ color ] of myself ]
if any? my-points [
let new-xcor mean [ xcor ] of my-points
let new-ycor mean [ ycor ] of my-points
if distancexy new-xcor new-ycor > movement-threshold [
set any-centroids-moved? true
]
setxy new-xcor new-ycor
]
]
end

to-report closest-centroid
report min-one-of centroids [ distance myself ]
end

to-report square-deviation
report sum [ (distance myself) ^ 2 ] of datapoints with [ closest-centroid = myself ]
end

to-report mean-square-deviation
report mean [ square-deviation ] of centroids
end

; Copyright 2014 Uri Wilensky.
; See Info tab for full copyright and license.
```

There is only one version of this model, created 7 months ago by Tânia Duarte.

## Attached files

No files

This model does not have any ancestors.

This model does not have any descendants.