SUTO6100_Text Information Content
Do you have questions or comments about this model? Ask them here! (You'll first need to log in.)
WHAT IS IT?
This is a tool to calculate the Shannon information content (H) of any text.
Shannon information content is usually expressed by the average number of bits needed to store or communicate one symbol in a message. This measure of information content quantifies the uncertainty involved in predicting the value of a future event (or random variable).
HOW IT WORKS
This tool accepts text as input and calculates the frequency with which each word occurs. A "frequency table" is built. The parsing is quite crude, "The" is considered to be a different word from "the". After calculating frequencies, a second table is built, called "probability table". This table holds the probability of encountering any particular word in the text if you were to be given a single word at random from the text. The probability for each word is calculated by dividing the frequency count of any word by the total words in the message.
The probabilities from the probability table are used to build the sum that is H. This is done by taking the sum of probability of each element times the log of the probability and multiplying by negative 1. In other words, sum of - p log p over all elements in the message.
HOW TO USE IT
Load text and press "Go". Alternatively, use the buttons on the right to load sample text.
THINGS TO NOTICE
Notice the use of the table extension. This data structure make it much easier to organize the tables of information that we create to store frequency and probability data. Each table simply a set of ordered pairs Key ---> Value. To find the value of an Key use the table:get primitive.
THINGS TO TRY/EXTENDING THE MODEL
Improvements to the parser could certainly be made. Also, more interesting ways to visualize the entropy of different contexts would be valuable.
CREDITS AND REFERENCES
This model is part of the Information Theory series of the Complexity Explorer project.
Main Author: John Balwit
Contributions from: Melanie Mitchell
Netlogo: Wilensky, U. (1999). NetLogo. http://ccl.northwestern.edu/netlogo/. Center for Connected Learning and Computer-Based Modeling, Northwestern University, Evanston, IL.
HOW TO CITE
If you use this model, please cite it as: "Text Information Content" model, Complexity Explorer project, http://complexityexplorer.org
COPYRIGHT AND LICENSE
Copyright 2016 Santa Fe Institute.
This model is licensed by the Creative Commons Attribution-NonCommercial-ShareAlike International ( http://creativecommons.org/licenses/ ). This states that you may copy, distribute, and transmit the work under the condition that you give attribution to ComplexityExplorer.org, and your use is for non-commercial purposes.
Comments and Questions
extensions [table] globals [txt freq-table probabilty-table word-count Max-Word-Count] to startup ca set input "" end to setup ca set Max-Word-Count 1000 set txt input build-frequency-table list-of-words build-probability-table sort-list plot-frequencies list-most-frequent-words end to-report list-of-words let $txt txt set $txt word $txt " " ; add space for loop termination let words [] ; list of values while [not empty? $txt] [ let n position " " $txt ;show word "n: " n let $item substring $txt 0 n ; extract item if not empty? $item [if member? last $item ".,?!;:" [set $item butlast $item ] ] ; strip trailing punctuation ;carefully [set $item read-from-string $item ][ ] ; convert if number carefully [if member? first $item " abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890" [set words lput $item words]][] ; append to list, ingnore cr/lfs set $txt substring $txt (n + 1) length $txt ; remove $item and space ] report words print "" end to build-frequency-table [#word] set freq-table table:make set probabilty-table table:make set word-count 0 foreach #word [ ?1 -> set word-count word-count + 1 ;; find total count of words if word-count >= Max-Word-Count [stop] ifelse table:has-key? freq-table ?1 [let i table:get freq-table ?1 table:put freq-table ?1 i + 1 ] [table:put freq-table ?1 1] ] end to build-probability-table foreach table:keys freq-table [ ?1 -> table:put probabilty-table ?1 table:get freq-table ?1 * (1 / word-count) ] ;print freq-table ;print probabilty-table end to-report H let sum-plogp 0 foreach table:keys probabilty-table [ ?1 -> let p table:get probabilty-table ?1 set sum-plogp sum-plogp + -1 * p * log p 2 ] report sum-plogp end to plot-frequencies foreach table:keys freq-table [ ?1 -> plot table:get freq-table ?1 ] end to sort-list let freq-list [] foreach table:keys freq-table [ ?1 -> set freq-list lput list ?1 table:get freq-table ?1 freq-list ] ;builds a list version of the table. set freq-table table:from-list sort-by [ [?1 ?2] -> last ?1 > last ?2 ] freq-list ;sort list by frequency counts and recreates table. end to list-most-frequent-words let i 0 foreach table:keys freq-table [ ?1 -> output-print (word table:get freq-table ?1 " x \"" ?1 "\"") set i i + 1 if i > 17 [output-print "... etc.. " stop] ] end
There is only one version of this model, created 3 days ago by Jalayer Khalilzadeh.
Attached files
No files
This model does not have any ancestors.
This model does not have any descendants.