55203

Best string search algorithm around

Question:

I have a code where in i compare a large data, say a source of a web page against some words in a file. What is the best algorithm to be used?

There can be 2 scenarios:

<ol><li>

If I have a large amount of words to compare against the source, In which case, for a normal string search algorithm, it would have to take a word, compare against the data, take the next and compare against the data and so on until all is complete.

</li> <li>

I have only a couple of words in the file and the normal string search would be ok, but still need to reduce the time as much as possible.

</li> </ol>

What algorithm is best? I know about Boyer-Moore and also Rabin-Karp search algorithms. Although Boyer-Moore search seems fast, I would also like names of other algorithms and their comparisons.

Answer1:

In both cases, I think you probably want to construct a patricia trie (also called radix tree). Most importantly, lookup time would be O(k), where k is the max length of a string in the trie.

Answer2:

Note that Boyer-Moore is to search a text (<em>several</em> words) within a text.

If all you want is identifying some individual words, then it's much easier to:

<ul><li>put each <em>searched</em> word in a dictionary structure (whatever it is)</li> <li>look-up each word of the text in the dictionary</li> </ul>

This most notably mean that you read the text as a stream, and need not hold it all in memory at once (which works great with the typical example of a file cursor).

As for the structure of the dictionary, I would recommend a simple hash table. Works great memory-wise compared to tree structures.

Recommend

  • Optimize javascript code on searching for matches in array
  • Affect m students to n groups?
  • Pandas calculating age from a date
  • XPath selecting multiple nodes
  • css positioning
  • Optimal approach to find occurrences of each phrase from a list in a string in Java
  • How to query property value when property name is a parameter?
  • express.static handling root url request
  • Watir-webdriver timing out when asked if element is present?
  • How to insert an Image in WORD after a bookmark using OpenXML
  • Java ClassLoader Confusion
  • What does the “?” mean in the following statement
  • Writing a recursive function on lists in Haskell
  • Hadoop shuffle uses which protocol?
  • Issue with std::thread when using g++ in 32-bit MinGW 4.8.0
  • finding greatest prime factor using recursion in c
  • IIS 7.5 404 Error for .PDF files
  • overlapping appointments using the entity framework
  • how to load css classes from my own project specfic css in Sitecore's RAD editor?
  • Imageloader not loading image on real device
  • Autohotkey script running program with command line arguments
  • jQueryMobile, Ajax Navigation, and MVC
  • calculating number of bytes of each row in an image
  • Why isn't my “Fizz Buzz” test in R working?
  • crash in __tcf_0
  • In Java, how can I construct a File from a resource?
  • Magento get URL before current
  • For loop with if condition on multiple R functions
  • Prevent Tomcat from caching request during starup
  • Primefaces ManyCheckbox inside ui:repeat calls setter method only for last loop
  • How do I signal completion of my dataflow?
  • How to use JavaScript to determine whether a file exists in a directory?
  • Content-Length header not returned from Pylons response
  • Does CUDA 5 support STL or THRUST inside the device code?
  • How to pass list parameters for each object using Spring MVC?
  • PHP: When would you need the self:: keyword?
  • Revoking OAuth Access Token Results in 404 Not Found
  • Can Visual Studio XAML designer handle font family names with spaces as a resource?
  • How can I remove ASP.NET Designer.cs files?
  • Are Kotlin's Float, Int etc optimised to built-in types in the JVM? [duplicate]