53701

fast string search to find string array element which matches the givven pattern

Question:

I have an array of constant strings which I iterate through to find an index of element which is a string that contains a search pattern. Which search algorithm should I choose to improve the speed of finding this element? I am not limited in time before running the application for preparing the look up tables if any are necessary.

I corrected a question - I am not doing exact string match - I am searching for pattern inside the element, which is in an array

array of strings: [0] Red fox jumps over the fence [1] Blue table [2] Red flowers on the fence

I need to find an element which contains word 'table' - in this case its element 1

I do like 50000 iterations of a set of 30 array which could contain up to 30000 strings of not less than 128 characters. Now I am using good-old strstr brute force which is too slow...

Ok, posting a part of my function, the first strstr - looks up in an uncut array of lines if there are any occurrences, then the brute search follows. I know I can speed this part, but I am not doing optimization on this approach...

// sheets[i].buffer - contains a page of a text which is split into lines // fullfunccall - is the pattern // sheets[i].cells[k] - are the pointers to lines in a buffer for( i=0; i<sheetcount; i++) { if( i!= currsheet && sheets[i].name && sheets[i].name[0] != '\0') { if( strstr(sheets[i].buffer, fullfunccall )) { usedexternally = 1; int foundatleastone = 0; for( k=0; k<sheets[i].numcells; k++ ) { strncpy_s(testline, MAX_LINE_SIZE, sheets[i].cells[k]->line, sheets[i].cells[k]->linesize); testline[sheets[i].cells[k]->linesize] = '\0'; if( strstr(testline, fullfunccall )) { dependency_num++; if( dependency_num >= MAX_CELL_DEPENDENCIES-1) { printf("allocation for sheet cell dependencies is insuficcient\n"); return; } sheets[currsheet].cells[currcellposinsheet]->numdeps = dependency_num+1; foundatleastone++; sheets[currsheet].cells[currcellposinsheet]->deps[dependency_num] = &sheets[i].cells[k]; } } if( foundatleastone == 0 ) { printf("error locating dependency for external func: %s\n", fullfunccall ); return; } } }; }

Answer1:

You wrote that your 'haystack' (the set of strings to search through) is roughly 30000 strings with approx. 200 characters each. You also wrote that the 'needle' (the term to search for) is either a string of 5 or 20 characters.

Based on this, you could precompute a hashtable which maps any 5-character subsequence to the string(s) in the haystack it occurs in. For 30000 strings (200 characters each) there are at most 30000 * (200 - 5) = 5.850.000 different 5-character substrings. If you hash each of it to a 16bit checksum you'd need a minimum 11MB of memory (for the hash keys) plus some pointers pointing to the string(s) in which the substring occurs.

For instance, given a simplfied haystack of

static const char *haystack[] = { "12345", "123456", "23456", "345678" };

you precompute a hash map which maps any possible 5-character string such that

12345 => haystack[0], haystack[1] 23456 => haystack[1], haystack[2] 34567 => haystack[3] 45678 => haystack[4]

With this, you could take the first five characters of your given key (either 5 or 20 characters long), hash it and then do a normal strstr through all the strings to which the key is mapped by the hash.

Answer2:

For each sheet that you are treating, you could build a suffix array as described in <a href="http://www.cs.bell-labs.com/cm/cs/pearls/sec152.html" rel="nofollow">this article</a>. Before you start your search, read the sheet, find the line beginnings (as integer indices into the sheet buffer), create the suffix array and sort it as described in the article.

Now, if you are looking for the lines in which a pattern, say "table", occurs, you can search for the next entry after "table" and the next entry after "tablf", which is the first non-match, where you have moved the right-most letter, odometer-style.

If both indices are the same, there are no matches. If they are different, you'll get a list of pointers into the sheet:

"tab. And now ..." ---------------------------------------------------------------- "table and ..." 0x0100ab30 "table water for ..." 0x0100132b "tablet computer ..." 0x01000208 ---------------------------------------------------------------- "tabloid reporter ..."

This will give you a list of pointers from which, by subtracting the base pointer of the sheet buffer, you can get the integer offsets. Comparison with the line beginnings will give you the line numbers that correspond to these pointers. (The line numbers are sorted, so you can do binary search here.)

The memory overhead is an array of pointers that has the same size as the sheet buffer, so for 30,000 strings of 200 chars, that will be about 48MB on a 64-bit machine. (The overhead of the line indices is negligible.)

Sorting the array will take long, but it is done only once for each sheet.

<strong>Edit</strong>: The idea seems to work well. I have implemented it and can scan a dictionary of about 130,000 words on a <a href="http://www.gutenberg.org/cache/epub/1661/pg1661.txt" rel="nofollow">text file of nearly 600k</a> in less then one second:

#include <stdlib.h> #include <stdio.h> #include <string.h> #define die(...) exit((fprintf(stderr, "Fatal: " __VA_ARGS__), \ putc(10, stderr), 1)) typedef struct Sheet Sheet; struct Sheet { size_t size; /* Number of chars */ char *buf; /* Null-terminated char buffer */ char **ptr; /* Pointers into char buffer */ size_t nline; /* number of lines */ int *line; /* array of offset of line beginnings */ size_t naux; /* size of scratch array */ char **aux; /* scratch array */ }; /* * Count occurrence of c in zero-terminated string p. */ size_t strcount(const char *p, int c) { size_t n = 0; for (;;) { p = strchr(p, c); if (p == NULL) return n; p++; n++; } return 0; } /* * String comparison via pointers to strings. */ int pstrcmp(const void *a, const void *b) { const char *const *aa = a; const char *const *bb = b; return strcmp(*aa, *bb); } /* * Pointer comparison. */ int ptrcmp(const void *a, const void *b) { const char *const *aa = a; const char *const *bb = b; if (*aa == *bb) return 0; return (*aa < *bb) ? -1 : 1; } /* * Create and prepare a sheet, i.e. a text file to search. */ Sheet *sheet_new(const char *fn) { Sheet *sheet; FILE *f = fopen(fn, "r"); size_t n; int last; char *p; char **pp; if (f == NULL) die("Couldn't open %s", fn); sheet = malloc(sizeof(*sheet)); if (sheet == NULL) die("Allocation failed"); fseek(f, 0, SEEK_END); sheet->size = ftell(f); fseek(f, 0, SEEK_SET); sheet->buf = malloc(sheet->size + 1); sheet->ptr = malloc(sheet->size * sizeof(*sheet->ptr)); if (sheet->buf == NULL) die("Allocation failed"); if (sheet->ptr == NULL) die("Allocation failed"); fread(sheet->buf, 1, sheet->size, f); sheet->buf[sheet->size] = '\0'; fclose(f); sheet->nline = strcount(sheet->buf, '\n'); sheet->line = malloc(sheet->nline * sizeof(*sheet->line)); sheet->aux = NULL; sheet->naux = 0; n = 0; last = 0; p = sheet->buf; pp = sheet->ptr; while (*p) { *pp++ = p; if (*p == '\n') { sheet->line[n++] = last; last = p - sheet->buf + 1; } p++; } qsort(sheet->ptr, sheet->size, sizeof(*sheet->ptr), pstrcmp); return sheet; } /* * Clean up sheet. */ void sheet_delete(Sheet *sheet) { free(sheet->buf); free(sheet->ptr); free(sheet->line); free(sheet->aux); free(sheet); } /* * Binary range search for string pointers. */ static char **pstr_bsearch(const char *key, char **arr, size_t high) { size_t low = 0; while (low < high) { size_t mid = (low + high) / 2; int diff = strcmp(key, arr[mid]); if (diff < 0) high = mid; else low = mid + 1; } return arr + low; } /* * Binary range search for line offsets. */ static const int *int_bsearch(int key, const int *arr, size_t high) { size_t low = 0; while (low < high) { size_t mid = (low + high) / 2; int diff = key - arr[mid]; if (diff < 0) high = mid; else low = mid + 1; } if (low < 1) return NULL; return arr + low - 1; } /* * Find occurrences of the string key in the sheet. Returns the * number of lines in which the key occurs and assigns up to * max lines to the line array. (If max is 0, line may be NULL.) */ int sheet_find(Sheet *sheet, char *key, int line[], int max) { char **begin, **end; int n = 0; size_t i, m; size_t last; begin = pstr_bsearch(key, sheet->ptr, sheet->size); if (begin == NULL) return 0; key[strlen(key) - 1]++; end = pstr_bsearch(key, sheet->ptr, sheet->size); key[strlen(key) - 1]--; if (end == NULL) return 0; if (end == begin) return 0; m = end - begin; if (m > sheet->naux) { if (sheet->naux == 0) sheet->naux = 0x100; while (sheet->naux < m) sheet->naux *= 2; sheet->aux = realloc(sheet->aux, sheet->naux * sizeof(*sheet->aux)); if (sheet->aux == NULL) die("Re-allocation failed"); } memcpy(sheet->aux, begin, m * sizeof(*begin)); qsort(sheet->aux, m, sizeof(*begin), ptrcmp); last = 0; for (i = 0; i < m; i++) { int offset = sheet->aux[i] - sheet->buf; const int *p; p = int_bsearch(offset, sheet->line + last, sheet->nline - last); if (p) { if (n < max) line[n] = p - sheet->line; last = p - sheet->line + 1; n++; } } return n; } /* * Example client code */ int main(int argc, char **argv) { Sheet *sheet; FILE *f; if (argc != 3) die("Usage: %s patterns corpus", *argv); sheet = sheet_new(argv[2]); f = fopen(argv[1], "r"); if (f == NULL) die("Can't open %s.", argv[1]); for (;;) { char str[80]; int line[50]; int i, n; if (fgets(str, sizeof(str), f) == NULL) break; strtok(str, "\n"); n = sheet_find(sheet, str, line, 50); printf("%8d %s\n", n, str); if (n > 50) n = 50; for (i = 0; i < n; i++) printf(" [%d] %d\n", i, line[i] + 1); } fclose(f); sheet_delete(sheet); return 0; }

The implementation has its rough edges, but it works. I'm not especially fond of the scratch array and the additional sorting on the found pointer range, but it turns out that even sorting the large suffix array doesn't take too long.

You can extend this solution to more sheets, if you like.

Answer3:

I believe the most practical would be DFA as it reads every character of input at most once - more precisely it reads every input char once and stops as soon as pattern will not match definitely (if set up properly). With DFA you can also check against multiple patterns simultaneously.

Two good (but different) implementations of DFA algorithms well tested in practice are

<ul><li><a href="https://github.com/yandex/pire" rel="nofollow">PIRE</a></li> <li><a href="http://www.complang.org/ragel/" rel="nofollow">Ragel</a></li> </ul>

It's not possible say which fits your task unless you provide more on that.

edit: DFA stays for "Deterministic Finite Automata"

edit: as you indicated your patterns are exact substrings the most common solution is <a href="http://www.geeksforgeeks.org/searching-for-patterns-set-2-kmp-algorithm/" rel="nofollow">KMP</a> algorithm (Knuth-Morris-Pratt)

Recommend

  • Better way to find the powers of 2
  • Saving iterator from python's zip
  • How to check if list contains another list in same order
  • What is the unit of ImageFont.textsize() returned values?
  • fast string search to find string array element which matches the givven pattern
  • Extract Last Word in String with Swift
  • billing agreement token lifetime
  • Length of the longest sorted subsequence
  • Stopping cut off text in variable height div with css and javascript that has elements as children
  • Load frequent subsequences from TXT
  • Xolvio Cucumber - Getting errors in console yet all tests are passing
  • General contract for object comparision : equals() and hashCode()
  • Recursive function not behaving correctly
  • Passing a hashtable from C# to Powershell
  • Cassandra: What is a subcolumn
  • How can I count unique terms in a plaintext file case-insensitively?
  • Using HTML/CSS for UI in XNA?
  • C function strchr - How to calculate the position of the character?
  • Problem with Django using Apache2 (mod_wsgi), Occassionally is “unable to import from module” for no
  • Image map in Flex
  • UIAlertController button function not working
  • Trying to get the char code of ENTER key
  • Can't delete or rename original file after resizing
  • Android Google Maps API v2 start navigation
  • Does it make sense to call System.gc() and Thread.sleep() when working on Bitmaps?
  • MongoDb aggregation
  • Is there a way to do normal logging with EureakLog?
  • preg_replace Double Spaces to tab (\\t) at the beginning of a line
  • Use of this Javascript
  • Allowing both email and username for authentication
  • C++ Partial template specialization - design simplification
  • Jenkins: How To Build multiple projects from a TFS repository?
  • How to get next/previous record number?
  • Python: how to group similar lists together in a list of lists?
  • using HTMLImports.whenReady not working in chrome