Improved K-means Clustering Model

No preview image

1 collaborator

WHAT IS IT?

Often, when we have data about a large set of objects, we want to identify groups of similar objects within that set. For example, if we have many news articles, we may want to identify groups of articles that all have the same topic. In statistics, this task is called “cluster analysis”, or “clustering”.

This model shows the k-means clustering algorithm. a simple, but often effective approach to clustering. In this model, the k-means clustering algorithm is used to identify clusters of points on a plane. In the general case, you can represent your data objects as vectors of numbers, where each number represents a feature of the object.

The model is initialized by creating a specific number of clusters (NUM-CLUSTERS). In a real-world application, this number of clusters would be unknown. The algorithm is designed to find the clusters, and the user selects a guess, NUM-CENTROIDS, which tells the algorithm, how many clusters to search for. You can search for different numbers of clusters till you find one that best characterizes your data.

The result of the algorithm is a set of NUM-CENTROIDS points, (each point is called a centroid), each of which is located at the average position of a corresponding cluster. The “k” in k-mean clustering refers to the guess at how many centroids to search for.

The purpose of the model is to allow you to see how the number of data points, clusters, and centroids interact, and how the algorithm can often find many different sets of clusters for the same dataset.

HOW IT WORKS

The algorithm works by finding the average position of the elements in NUM-CENTROIDS different clusters. It starts by simply guessing an average position for each of the NUM-CENTROIDS clusters, and then improves these guesses as it goes on.

When the model is set up, two things happen: First, NUM-DATA-POINTS data points are generated and shown in the view. Second, NUM-CENTROIDS centroids are generated and moved to a randomly selected data point. This randomly selected point is the initial guess for each centroid.

Each tick, the model performs two steps: 1. ASSIGN POINTS: all data points assign themselves to their closest centroid, taking on its color. 2. UPDATE CENTROIDS: all centroids move to the average position of the points assigned to it.

By iterating these steps a few times, the model is usually able to identify clusters in the dataset.

HOW TO USE IT

First choose how many clusters you want to create and with how many data points, using respectively the NUM-CLUSTERS slider and the NUM-DATA-POINTS slider. Using the NUM-CENTROIDS slider, you can decide how many centroids you will create, which is how many clusters to search for. With real data, you won’t necessarily know how many naturally occurring clusters there are. By making NUM-CLUSTERS and NUM-CENTROIDS different, you can see what the algorithm does when it’s looking for too many or not enough clusters. Finally, click SETUP to generate your dataset and your centroids.

At this point you can choose to call each of the two steps manually using the ASSIGN POINTS, and UPDATE CENTROIDS buttons. This lets you see exactly how each of the steps work.

Alternatively you can press the FIND CLUSTERS (GO) button.

The model stops when the points assigned to the centroids don’t change. When this happens, we say that the centroids have “converged”.

RESET CENTROIDS sets the position of the centroids to random points in your data set, so that you can see if the clustering algorithm finds different sets of clusters depending on its initial guesses.

THINGS TO NOTICE

You will often get very different results, depending on the initial guesses for the centroids’ location. Try and explore each data set several times, both by resetting the centroids, and by changing the number of centroids you place in the dataset.

THINGS TO TRY

K-means clustering won't necessarily find the best solution. In fact, there are many solutions that it can converge to. Each of these solutions will be locally optimal. In other words, you can't find a better solution by moving the centroids by a small amount. However, better solutions may still exist.

For each dataset, you will probably be able to converge on many different solutions. You can try this by clicking the RESET CENTROIDS button after your model has already converged on one set, and running it again. This becomes particularly obvious if NUM-CLUSTERS and NUM-CENTROIDS are not equal.

EXTENDING THE MODEL

The dataset currently is distributed in clusters across a 2-d space. It could be interesting to add the ability to either import datasets and find clusters in them, or somehow allow the user to create their own datasets.

Currently, only the data points that are identified as belonging to a cluster change their color as the algorithm runs. Another interesting way to illustrate clusters might be to change the color of patches that are closest to the centroids. This would create a Voronoi diagram of the clusters (see under 'Related models' if you are curious about Voronoi diagrams).

A clustering algorithm requires a measure to assess bow well it has clustered. This model uses the square deviation of the Euclidean distance. This measure has some good properties, but it assigns better scores to higher numbers of centroids. Can you think of other measures that improve the clustering?

NETLOGO FEATURES

CREATE-TEMPORARY-PLOT-PEN: The model plots the number of data points for each of the centroids. However, because this number might change during runtime (i.e. if you let the model run, change the number of clusters, and then run the model again), the model uses temporary pens that are created and die with the centroids.

RELATED MODELS

Voronoi
Emergent Voronoi
MaterialSim Grain Growth
Fur
Honeycomb
Scatter
Hotelling's Law

REFERENCES

For more information on k-means clustering, see https://en.wikipedia.org/wiki/K-means_clustering. (There are also many other sites on this topic on the web.)

HOW TO CITE

If you mention this model or the NetLogo software in a publication, we ask that you include the citations below.

For the model itself:

Hjorth, A., Head, B. and Wilensky, U. (2014). NetLogo K-Means Clustering model. http://ccl.northwestern.edu/netlogo/models/K-MeansClustering. Center for Connected Learning and Computer-Based Modeling, Northwestern University, Evanston, IL.

Please cite the NetLogo software as:

Wilensky, U. (1999). NetLogo. http://ccl.northwestern.edu/netlogo/. Center for Connected Learning and Computer-Based Modeling, Northwestern University, Evanston, IL.

COPYRIGHT AND LICENSE

CC BY-NC-SA 3.0

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/3.0/ or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.

Commercial licenses are also available. To inquire about commercial licenses, please contact Uri Wilensky at uri@northwestern.edu.

Comments and Questions

Please start the discussion about this model! (You'll first need to log in.)

Click to Run Model

breed [datapoints datapoint]
breed [centroids centroid]

globals [any-centroids-moved?]

to setup
  clear-all
  set-default-shape datapoints "circle"
  set-default-shape centroids "x"
  generate-datapoints
  reset-centroids
end 

to setup2
  clear-all
  set-default-shape datapoints "circle"
  set-default-shape centroids "x"
  generate-datapoints
  reset-centroids3
  reset-ticks
end 

to generate-datapoints
  repeat num-datapoints [
    let center-x random-xcor / 1.5
    let center-y random-ycor / 1.5
    create-datapoints num-datapoints [
      setxy center-x center-y
      set heading random 360
        ]]
end 

to reset-centroids
  set any-centroids-moved? true
  ask datapoints [ set color grey ]

  let colors base-colors
  ask centroids [die]
  create-centroids num-centroids [
    move-to one-of datapoints
    set size 5
    set color last colors + 1
    set colors butlast colors
  ]
  ;;clear-all-plots
  reset-ticks
end 

to reset-centroids2
  set any-centroids-moved? true
  ask datapoints [ set color grey ]

  let colors base-colors
  ask centroids [die]
  create-centroids num-centroids [
    move-to one-of datapoints
    set size 5
    set color last colors + 1
    set colors butlast colors
  ]
end 

to reset-centroids3
  set any-centroids-moved? true
  ask datapoints [ set color grey ]

  let colors base-colors
  ask centroids [die]
  create-centroids 1 [
    move-to one-of datapoints
    set size 5
    set color last colors + 1
    set colors butlast colors
  ]
end 

to go
  if not any-centroids-moved? [stop]
  set any-centroids-moved? false
  assign-clusters
  update-clusters
  tick
end 

to go2
  set num-centroids (ticks + 1)
  reset-centroids2
  while[any-centroids-moved?][
    set any-centroids-moved? false
    assign-clusters
    update-clusters2
  ]
  tick
end 

to assign-clusters
  ask datapoints [set color ([color] of (closest-centroid) - 2)]
end 

to update-clusters
  let movement-threshold 0.1
  ask centroids [
    let my-points datapoints with [ shade-of? color [ color ] of myself ]
    if any? my-points [
      let new-xcor mean [ xcor ] of my-points
      let new-ycor mean [ ycor ] of my-points
      if distancexy new-xcor new-ycor > movement-threshold [
        set any-centroids-moved? true
      ]
      setxy new-xcor new-ycor
    ]
  ]
  update-plots
end 

to update-clusters2
  let movement-threshold 0.1
  ask centroids [
    let my-points datapoints with [ shade-of? color [ color ] of myself ]
    if any? my-points [
      let new-xcor mean [ xcor ] of my-points
      let new-ycor mean [ ycor ] of my-points
      if distancexy new-xcor new-ycor > movement-threshold [
        set any-centroids-moved? true
      ]
      setxy new-xcor new-ycor
    ]
  ]
end 

to-report closest-centroid
  report min-one-of centroids [ distance myself ]
end 

to-report square-deviation
  report sum [ (distance myself) ^ 2 ] of datapoints with [ closest-centroid = myself ]
end 

to-report mean-square-deviation
  report mean [ square-deviation ] of centroids
end 



; Copyright 2014 Uri Wilensky.
; See Info tab for full copyright and license.

There is only one version of this model, created over 7 years ago by Tânia Duarte.

Attached files

No files

This model does not have any ancestors.

This model does not have any descendants.

NetLogo