Wednesday, June 27, 2007

Words in a text

Here is shown the method GetWordsFromString(string s), which is takes some text on it's input, and returns list of words used in text sorted by number of occurences of each word in the text.

        public class WordsComparer : IEqualityComparer<string>
{
public bool Equals(string x, string y)
{
return x.ToLowerInvariant() == y.ToLowerInvariant();
}
public int GetHashCode(string obj)
{
return obj.ToLowerInvariant().GetHashCode();
}
}
private static Dictionary<string, int> GetWordsFromString(string s)
{
Dictionary<string, int> resultDictionary =
new Dictionary<string, int>(new WordsComparer());
Regex wordsRegEx = new Regex(@"\w{3,}");
MatchCollection matches = wordsRegEx.Matches(s);
foreach (Match match in matches)
{
if (!resultDictionary.ContainsKey(match.Value))
{
resultDictionary.Add(match.Value, 1);
}
else
{
resultDictionary[match.Value]++;
}
}
List<KeyValuePair<string, int>> sortedList =
new List<KeyValuePair<string,int>>();
foreach(string key in resultDictionary.Keys)
{
sortedList.Add(new KeyValuePair<string, int>(key, resultDictionary[key]));
}
sortedList.Sort
(
delegate (KeyValuePair<string, int> o1, KeyValuePair<string, int> o2)
{
return o2.Value.CompareTo(o1.Value);
}
);
resultDictionary.Clear();
foreach (KeyValuePair<string, int> kvp in sortedList)
{
resultDictionary.Add(kvp.Key, kvp.Value);
}
return resultDictionary;
}
For example the top 100 words of Tolkien's "The Lord of the Rings" are:

the 33644
and 22049
that 6766
was 6542
they 5214
You 5074
but 5044
his 4833
said 4229
not 4108
FOR 4021
with 3320
had 3245
were 2859
there 2784
have 2691
ALL 2557
him 2474
from 2296
them 2245
now 2205
their 2126
Frodo 1997
are 1818
then 1797
will 1791
out 1645
This 1596
Great 1388
came 1379
what 1357
Sam 1273
Long 1261
could 1235
come 1227
into 1214
more 1207
would 1204
down 1192
one 1172
Gandalf 1167
like 1160
When 1160
your 1155
again 1115
before 1101
some 1093
been 1064
back 1006
Many 1005
away 979
still 970
men 946
Last 903
upon 890
far 884
than 883
about 878
see 876
only 860
did 837
over 834
HERE 833
yet 823
Dark 817
its 809
time 805
has 796
ARAGORN 795
Old 785
well 772
can 761
way 758
went 754
any 743
even 729
must 712
may 712
seemed 712
where 707
our 704
shall 698
know 683
Pippin 676
which 670
looked 637
who 634
little 629
eyes 625
very 619
Hobbits 619
after 616
light 611
while 599
merry 584
Road 574
King 569
through 566
Ring 564
other 564

Tags: Sorting Dictionary by value

No comments: