SUTO6100_Text Information Content

No preview image

1 collaborator

WHAT IS IT?

This is a tool to calculate the Shannon information content (H) of any text.

Shannon information content is usually expressed by the average number of bits needed to store or communicate one symbol in a message. This measure of information content quantifies the uncertainty involved in predicting the value of a future event (or random variable).

HOW IT WORKS

This tool accepts text as input and calculates the frequency with which each word occurs. A "frequency table" is built. The parsing is quite crude, "The" is considered to be a different word from "the". After calculating frequencies, a second table is built, called "probability table". This table holds the probability of encountering any particular word in the text if you were to be given a single word at random from the text. The probability for each word is calculated by dividing the frequency count of any word by the total words in the message.

The probabilities from the probability table are used to build the sum that is H. This is done by taking the sum of probability of each element times the log of the probability and multiplying by negative 1. In other words, sum of - p log p over all elements in the message.

HOW TO USE IT

Load text and press "Go". Alternatively, use the buttons on the right to load sample text.

THINGS TO NOTICE

Notice the use of the table extension. This data structure make it much easier to organize the tables of information that we create to store frequency and probability data. Each table simply a set of ordered pairs Key ---> Value. To find the value of an Key use the table:get primitive.

THINGS TO TRY/EXTENDING THE MODEL

Improvements to the parser could certainly be made. Also, more interesting ways to visualize the entropy of different contexts would be valuable.

CREDITS AND REFERENCES

This model is part of the Information Theory series of the Complexity Explorer project.

Main Author: John Balwit

Contributions from: Melanie Mitchell

Netlogo: Wilensky, U. (1999). NetLogo. http://ccl.northwestern.edu/netlogo/. Center for Connected Learning and Computer-Based Modeling, Northwestern University, Evanston, IL.

HOW TO CITE

If you use this model, please cite it as: "Text Information Content" model, Complexity Explorer project, http://complexityexplorer.org

COPYRIGHT AND LICENSE

This model is licensed by the Creative Commons Attribution-NonCommercial-ShareAlike International ( http://creativecommons.org/licenses/ ). This states that you may copy, distribute, and transmit the work under the condition that you give attribution to ComplexityExplorer.org, and your use is for non-commercial purposes.

Comments and Questions

Please start the discussion about this model! (You'll first need to log in.)

Click to Run Model

extensions [table]

globals [txt freq-table probabilty-table word-count Max-Word-Count]

to startup
  ca
  set input ""
end 

to setup
  ca
  set Max-Word-Count 1000


  set txt input
  build-frequency-table list-of-words
  build-probability-table
  sort-list
  plot-frequencies
  list-most-frequent-words
end 

to-report list-of-words
  let $txt txt
  set $txt word $txt " "  ; add space  for loop termination
  let words []  ; list of values
  while [not empty? $txt]
  [ let n position " " $txt
    ;show word "n: " n
    let $item substring $txt 0 n  ; extract item
    if not empty? $item [if member? last $item ".,?!;:" [set $item butlast $item ] ] ; strip trailing punctuation
    ;carefully [set $item read-from-string $item ][ ] ; convert if number
    carefully [if member? first $item " abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890" [set words lput $item words]][]  ; append to list, ingnore cr/lfs
    set $txt substring $txt (n + 1) length $txt  ; remove $item and space
  ]
  report words
  print ""
end 

to build-frequency-table [#word]
  set freq-table table:make
  set probabilty-table table:make
  set word-count 0
  foreach #word [ ?1 ->
    set word-count word-count + 1  ;; find total count of words
    if word-count >= Max-Word-Count [stop]
    ifelse table:has-key? freq-table ?1  [let i table:get freq-table ?1 table:put freq-table ?1 i + 1 ] [table:put freq-table ?1 1]
    ]
end 

to build-probability-table
  foreach table:keys freq-table [ ?1 -> table:put probabilty-table ?1 table:get freq-table ?1 * (1 / word-count) ]

  ;print freq-table
  ;print probabilty-table
end 

to-report H
  let sum-plogp 0
  foreach table:keys probabilty-table
   [ ?1 ->
     let p table:get probabilty-table ?1
     set sum-plogp  sum-plogp  + -1 * p * log p 2
   ]
   report sum-plogp
end 

to plot-frequencies
  foreach table:keys freq-table [ ?1 -> plot table:get freq-table ?1 ]
end 

to sort-list
  let freq-list []
  foreach table:keys freq-table [ ?1 -> set freq-list lput list ?1 table:get freq-table ?1  freq-list  ] ;builds a list version of the table.
  set freq-table table:from-list sort-by [ [?1 ?2] -> last ?1 > last ?2 ] freq-list ;sort list by frequency counts and recreates table.
end 

to list-most-frequent-words
  let i 0
  foreach table:keys freq-table [ ?1 -> output-print (word table:get freq-table ?1 " x \"" ?1 "\"") set i i + 1 if i > 17 [output-print "... etc.. " stop] ]
end

There is only one version of this model, created 3 days ago by Jalayer Khalilzadeh.

Attached files

No files

This model does not have any ancestors.

This model does not have any descendants.

NetLogo