76822

What is the optimal way to choose a set of features for excluding items based on a bitmask when matc

Question:

Suppose I have a large, static set of objects, and I have an object that I want to match against all of them according to a complicated set of criteria that entails an expensive test.

Suppose also that it's possible to identify a large set of features that can be used to exclude potential matches, thereby avoiding the expensive test. If a feature is present in the object I am testing, then I can exclude any objects in the set that don't have this feature. In other words, the presence of the feature is necessary but not sufficient for the test to pass.

In that case, I can precompute a bitmask for each object in the set indicating whether each feature is present or absent in the object. I can also compute it for the object that I want to test, and then loop through the array like this (pseudo-code):

objectMask = computeObjectMask(myObject) for(each testObject in objectSet) { if((testObject.mask & objectMask) != objectMask) { // early out, some features are in objectMask // but not in testObject.mask, so the test can't pass } else if(doComplicatedTest(testObject, myObject) { // found a match! } }

<em>So my question is, given a limited bitmask size, and a large list of possible features, and a table of the frequencies of each feature in object set (plus access to the object set if you want to compute correlations between features and so on), what algorithm can I use to choose the optimal set of features for inclusion in my bitmask to maximize the number of early outs and minimize the number of tests?</em>

If I just choose the top x most common features, then chance of a feature being in both masks is higher, so it seems like the number of early outs would be reduced. However if I choose the x least common features then objectMask might frequently be zero, meaning no early outs are possible. It seems pretty easy to experiment and come up with a set of middling-frequency features that gives good performance, but I'm interested in whether there is a theoretical best way of doing it.

Note: the frequency of each feature is assumed to be the same in the set of possible myObjects as in the objectSet, although I'd be interested to know how to handle if it isn't. I'd also be interested to know if there is an algorithm for finding the best feature set given a large sample of candidate objects that are to be matched against the set.

Possible applications: matching an input string against a large number of regexes, matching a string against a large dictionary of words using a criteria such as "must contain the same letters in the same order, but possibly with extra characters inserted anywhere in the word", etc. Example features: "contains the literal character D", "contains the character F followed by the character G later in the string" etc. Obviously the set of possible features will be highly dependent on the specific application.

Answer1:

You can try aho-corasick algorithm. Its the fastest multi pattern matcher. Basically its a finite state machine with failure links computed with a breadth-first search of the trie.

Recommend

  • [OpenFileDialog-saveFileDialog error]The process cannot access the files because it is being used by
  • Behavioral to Structural Conversion Problems VHDL
  • Positioning buttons below menu bar with GridBagLayout
  • Libtool issue while compiling Liblinphone
  • Largest prime factor - euler project
  • Crc16 To String
  • App Engine OAuth2.0 authorized cron job to analyze Google Sheet
  • Could not find implicit value inside singleton object
  • how to get text from in nested html elements using jericho?
  • How to load Images in Codeigniter?
  • django point definition
  • Why doesn't the GCC assembly output generate a .GLOBAL for printf
  • Google Text To Speech as a sperate class that can be called when ever needed?
  • Expand in the same as
  • How to run robotium tests in a specific order?
  • What's the most efficient way to read, rebuild, and replace a block of content in a file using
  • Formatting with Charts
  • ObjectSet library is not being found?
  • Save disconnected object in entity framework 4
  • WP8 Memory leak opening and closing PhoneApplicationPage
  • How to replace spaces at the right into zeros at the left in COBOL?
  • What are some techniques to monitor multiple instances of a piece of software?
  • JPS useBean with HashMap
  • Pasting URLs That Have Been Scraped From a Webpage
  • Replace any string in columns with 1
  • Difficulties implementing the Hysteresis step of Canny Algorithm in Halide without define_extern fun
  • Ransack search string arrays stored in db
  • How to Configure Log4Net Custom Object Renderer for Generic Objects?
  • Why do GeoJSON features appear like a negative photo of the features themselves?
  • Thread synchronization with syncwarp
  • Ajax Loaded meta Tags
  • Xamarin Forms - UWP Fonts
  • Arrow is showed instead of the material design version hamburger icon. Why doesn't syncState in
  • Rearranging Cells in UITableView Bug & Saving Changes
  • Circular dependency while pushing http interceptor
  • Arrays break string types in Julia
  • Linker errors when using intrinsic function via function pointer
  • Angular 2 constructor injection vs direct access
  • FormattedException instead of throw new Exception(string.Format(…)) in .NET
  • IndexOutOfRangeException on multidimensional array despite using GetLength check