Word Counting

No preview image

1 collaborator

Default-person Martin Dobiasch (Author)

Tags

mapreduce 

Tagged by Martin Dobiasch almost 11 years ago

Visible to everyone | Changeable by the author
Model was written in NetLogo 5.0.5 • Viewed 404 times • Downloaded 36 times • Run 0 times
Download the 'Word Counting' modelDownload this modelEmbed this model

Do you have questions or comments about this model? Ask them here! (You'll first need to log in.)


WHAT IS IT?

This example should be the starting example when MapReduce is introduced. In case students are not familiar with multi-threading or parallel programming it can be presented in a way that the students first have to create a program counting the words in an already familiar manner like a plain iterative program (check out the iterative word count model).

What the students should take from the presentation is that with using MapReduce a simple task like counting all words in a set of files can be done in a very easy way with just a few lines of code.

This model also demonstrates how to modify the configuration of MapReduce

HOW IT WORKS

The model just starts a simple MapReduce job counting the words of the documents in a certain directory. The mapper is read-file. The reducer is sum-occurrences.

HOW TO USE IT

First one model needs to run as server. Therefore press the server button (starts a HubNet activity). On the other computers also load the model (not the HubNet-client) and press node. Once enough nodes/clients have joined the activity select the data-set you want to use and press "Count Words".

THINGS TO TRY

  • Try computing different data-sets by changing the selection.
  • Try computing the result with various counts of computers in the network

EXTENDING THE MODEL

Limiting Input

Later in the course it can be shown that with MapReduce the task of limiting the files to “.txt” files is done with a single command using the job configuration. Still, this simple limiting file to having the “.txt” extension can be done in plain-NetLogo by using the substring reporter and a simple if. However, in case wildcards should be used (for example only files starting with the letter ’a’ and ending with “.txt”, i.e. “a*.txt”) then this is not trivial for NetLogo while it remains a single command for the MapReduce-framework.

Filtering

The basic algorithm as treats “Gutenberg,” and “Gutenberg” as different words. For presenting MapReduce this is not a problem. However, the students can be encouraged to remove characters like “”’, “,” and “;” from the words in the mapper. The task could be presented in a way that the students would have to run the program, inspect the produced output, find out which words are not counted correctly and modify the program. That way the students will get a deeper insight into how the program works. Moreover, since they will have to modify the program on their own they will be able to perceive it as their own program. Thus, this enhancement can be used to enthuse students for MapReduce.

NETLOGO FEATURES

This model demonstrates how the MapReduce extension can be used.

Especially, it shows the code needed in order to make the framework spread the work among various computers

RELATED MODELS

Check out all other MapReduce model, especially the other WordCount versions.

Comments and Questions

Please start the discussion about this model! (You'll first need to log in.)

Click to Run Model

extensions [mapreduce]

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;;; Mapper
;;;; Read the line of a file, split it into words, emit word
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

to read-file [file-name words]
  ;;; Loop over all words
  foreach words
  [
    ;;;  emit: 
    mapreduce:emit ? "1" ; 
  ]
end 

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;;;; Reducer
;;;; Sum up the occurrences of a word
;;;;  key    the word
;;;;  accum  current count
;;;;  value  next value
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

to-report word-count [key acum value]
  report acum + read-from-string value
end 

to server
  mapreduce:acceptworkers
end 

to node
  mapreduce:node.connect "127.0.0.1" 9173
end 

to count-words
  let words []
  let runit true
  
  reset-ticks
  
  ;;; Tell MapReduce that a line has words, separated by spaces
  mapreduce:config.valueseparator " "
  ;;; Start the MapReduce computation
  let res mapreduce:mapreduce "read-file" "word-count" 0 data-set
  ;;; Wait or the computation to finish and display the progress
  while [mapreduce:running?] [
   every 0.5 [
       print mapreduce:map-progress
       print mapreduce:reduce-progress
       ; plot 1
       tick
     ]
  ]
  tick
  print "done"

  ;;; Print the result
  show mapreduce:result res
  tick
end 

There is only one version of this model, created almost 11 years ago by Martin Dobiasch.

Attached files

File Type Description Last updated
datasets.tar.gz data Some sample data sets for Word Counting almost 11 years ago, by Martin Dobiasch Download

This model does not have any ancestors.

This model does not have any descendants.