Word Counting
Do you have questions or comments about this model? Ask them here! (You'll first need to log in.)
WHAT IS IT?
This example should be the starting example when MapReduce is introduced. In case students are not familiar with multi-threading or parallel programming it can be presented in a way that the students first have to create a program counting the words in an already familiar manner like a plain iterative program (check out the iterative word count model).
What the students should take from the presentation is that with using MapReduce a simple task like counting all words in a set of files can be done in a very easy way with just a few lines of code.
This model also demonstrates how to modify the configuration of MapReduce
HOW IT WORKS
The model just starts a simple MapReduce job counting the words of the documents in a certain directory. The mapper is read-file. The reducer is sum-occurrences.
HOW TO USE IT
First one model needs to run as server. Therefore press the server button (starts a HubNet activity). On the other computers also load the model (not the HubNet-client) and press node. Once enough nodes/clients have joined the activity select the data-set you want to use and press "Count Words".
THINGS TO TRY
- Try computing different data-sets by changing the selection.
- Try computing the result with various counts of computers in the network
EXTENDING THE MODEL
Limiting Input
Later in the course it can be shown that with MapReduce the task of limiting the files to “.txt” files is done with a single command using the job configuration. Still, this simple limiting file to having the “.txt” extension can be done in plain-NetLogo by using the substring reporter and a simple if. However, in case wildcards should be used (for example only files starting with the letter ’a’ and ending with “.txt”, i.e. “a*.txt”) then this is not trivial for NetLogo while it remains a single command for the MapReduce-framework.
Filtering
The basic algorithm as treats “Gutenberg,” and “Gutenberg” as different words. For presenting MapReduce this is not a problem. However, the students can be encouraged to remove characters like “”’, “,” and “;” from the words in the mapper. The task could be presented in a way that the students would have to run the program, inspect the produced output, find out which words are not counted correctly and modify the program. That way the students will get a deeper insight into how the program works. Moreover, since they will have to modify the program on their own they will be able to perceive it as their own program. Thus, this enhancement can be used to enthuse students for MapReduce.
NETLOGO FEATURES
This model demonstrates how the MapReduce extension can be used.
Especially, it shows the code needed in order to make the framework spread the work among various computers
RELATED MODELS
Check out all other MapReduce model, especially the other WordCount versions.
Comments and Questions
extensions [mapreduce] ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;;;; Mapper ;;;; Read the line of a file, split it into words, emit word ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; to read-file [file-name words] ;;; Loop over all words foreach words [ ;;; emit:mapreduce:emit ? "1" ; ] end ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;;;; Reducer ;;;; Sum up the occurrences of a word ;;;; key the word ;;;; accum current count ;;;; value next value ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; to-report word-count [key acum value] report acum + read-from-string value end to server mapreduce:acceptworkers end to node mapreduce:node.connect "127.0.0.1" 9173 end to count-words let words [] let runit true reset-ticks ;;; Tell MapReduce that a line has words, separated by spaces mapreduce:config.valueseparator " " ;;; Start the MapReduce computation let res mapreduce:mapreduce "read-file" "word-count" 0 data-set ;;; Wait or the computation to finish and display the progress while [mapreduce:running?] [ every 0.5 [ print mapreduce:map-progress print mapreduce:reduce-progress ; plot 1 tick ] ] tick print "done" ;;; Print the result show mapreduce:result res tick end
There is only one version of this model, created almost 11 years ago by Martin Dobiasch.
Attached files
File | Type | Description | Last updated | |
---|---|---|---|---|
datasets.tar.gz | data | Some sample data sets for Word Counting | almost 11 years ago, by Martin Dobiasch | Download |
This model does not have any ancestors.
This model does not have any descendants.