Raymii.org
Quis custodiet ipsos custodes?Home | About | All pages | Cluster Status | RSS Feed
Word occurrence counter and analyzer
Published: 07-03-2013 | Author: Remy van Elst | Text only version of this article
❗ This post is over eleven years old. It may no longer be up to date. Opinions may have changed.
Table of Contents
With these commands you can analyze a text file. It will count all the occurrences of all words and put out the stats. It is usefull for song lyrics, books, notes and everything. It helps me analyze my writing style, which words do I use more often, where are my spelling errors and such. It is also nice to win an argument against someone over a dragonforce song. This example will use lyrics as example, but it is applicable to all text files.
Recently I removed all Google Ads from this site due to their invasive tracking, as well as Google Analytics. Please, if you found this content useful, consider a small donation using any of the options below:
I'm developing an open source monitoring app called Leaf Node Monitoring, for windows, linux & android. Go check it out!
Consider sponsoring me on Github. It means the world to me if you show your appreciation and you'll help pay the server costs.
You can also sponsor me by getting a Digital Ocean VPS. With this referral link you'll get $200 credit for 60 days. Spend $25 after your credit expires and I'll get $25!
Get the Lyrics (text)
First get the lyrics, or the text you want to analyze into a text file. I've heard nano, vi(m) and emacs are quite good with text. In this song I will use a song by Dragonforce. It does not matter which one because they're all full of the same words.
My lyrics file is named: df1.txt
Sanitize them
The tools we are going to use do not like all those comma's, colons, exclamation marks and weird non-alphanumeric characters. So sanitize the file like this:
cat df1.txt | tr -cd '[:alnum:] [:space:]' > df1san.txt
What this does is pump the file through the tr command, that command (with these arguments) strips everything which is not a-zA-Z0-9 or a space. Exactly what we want.
Analyze it Now we do the magic:
sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' dfsan.txt | sort | uniq -c | sort -nr | head -n 20
remy@vps8:~$ sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' dfsan.txt | sort | uniq -c | sort -nr | head -n 20
72 the
32
25 and
22 of
20 in
17 we
16 on
14 our
13 a
8 were
8 lost
8 for
7 will
7 still
7 light
6 to
6 so
6 fire
6 far
5 through
Other Example
on my class notes about blood and the immune system
remy@vps8:~$ cat afweer.txt | tr -cd '[:alnum:] [:space:]' > afweersan.txt
remy@vps8:~$ sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' afweersan.txt | sort | uniq -c | sort -nr | head -n 20
195
108 de
80 een
72 van
65 het
51 in
46 is
40 en
24 zijn
24 op
24 afweer
22 die
20 vraag
20 deze
19 worden
18 kan
17 bij
16 dit
15 er
14 of
After stripping it of the non-usefull words:
remy@vps8:~$ cat afwres.txt | head -n 10
24 afweer
14 cellen
11 bacterin
9 waar
9 reactie
9 antigeen
8 specifieke
7 milieu
7 lymfocyten
7 lichaam
Fabian Scherschels NanoWriMo 2011 Book: Nightwatch
GIT tree of the book & NaNoWiMo page Book is Creative Commons
Attribution-NonCommercial-ShareAlike 3.0 Unported License
1020 the
454 he
421 and
418 of
357 to
347 had
297 a
267 was
257 his
241 that
216 in
132 it
130 marc
112 him
108 as
105 this
105 they
93 with
90 but
82 were
82 from
82 been
82 at
74 on
70 would
68 for
68 could
56 their
56 be
53 out
51 into
50 man
49 all
48 there
48 so
48 by
47 looked
46 not
44 up
44 them
44 like
Analyzing IP and log files
Today I found another usefull use for this command. Analyzing IP adresses. First I grepped my entire lighttpd log file:
cat access.log | egrep -o '[[:digit:]]{1,3}.[[:digit:]]{1,3}.[[:digit:]]{1,3}.[[:digit:]]{1,3}' | tr [:space:] '\n' | grep -v "^\s*$" | sort | uniq -c | sort -bnr
(egrep -o spits out only the IP adress, not the whole line on which the IP adress is on)
That gives out this nice list (this list is made up, not real IP adresses):
2 83.64.150.248
2 94.0.74.75
2 94.142.55.252
2 95.237.133.3
2 98.225.130.26
3 108.100.28.45
3 213.93.70.87
5 81.30.145.69
348 66.228.43.247
467 173.255.236.50
Thanks to the wonderfull community at stackexchange
Tags: articles , awk , bash , log , lyrics , notes , sed , tr , word