
Question:
I have a code where in i compare a large data, say a source of a web page against some words in a file. What is the best algorithm to be used?
There can be 2 scenarios:
<ol><li>If I have a large amount of words to compare against the source, In which case, for a normal string search algorithm, it would have to take a word, compare against the data, take the next and compare against the data and so on until all is complete.
</li> <li>I have only a couple of words in the file and the normal string search would be ok, but still need to reduce the time as much as possible.
</li> </ol>What algorithm is best? I know about Boyer-Moore and also Rabin-Karp search algorithms. Although Boyer-Moore search seems fast, I would also like names of other algorithms and their comparisons.
Answer1:In both cases, I think you probably want to construct a patricia trie (also called radix tree). Most importantly, lookup time would be O(k), where k is the max length of a string in the trie.
Answer2:Note that Boyer-Moore is to search a text (<em>several</em> words) within a text.
If all you want is identifying some individual words, then it's much easier to:
<ul><li>put each <em>searched</em> word in a dictionary structure (whatever it is)</li> <li>look-up each word of the text in the dictionary</li> </ul>This most notably mean that you read the text as a stream, and need not hold it all in memory at once (which works great with the typical example of a file cursor).
As for the structure of the dictionary, I would recommend a simple hash table. Works great memory-wise compared to tree structures.