I have a Hibernate annotated class
TestClass that contains a
List<String> field that I am indexing with Lucene. Consider the following example:
"Foo Bar" and
"Bar Snafu" are two entries in the List for a particular record. Now, If a user searches on TestClass for
"Foo Snafu" then the record will be found, I am guessing because the token Foo and the token Snafu are both tokens in the
List<String> for this record.
Is there a way I can prevent this from happening?
The real world example is a Court case that has a List of Plaintiffs and Defendants. Say there are two people being prosecuted on the case,
Joe Lewis Bob and
Robert Clay Smith. These users are stored in the Court case record in a List of Defendants. This List of defendants is indexed with Lucene. Now if a user searches for either of the two defendants mentioned earlier, the case will be found. But the case will also be found if a user searches for
Lewis Smith, or
<strong>Update:</strong> It was mentioned in the Lucene IRC channel that I could possibly use a multi-valued field.
<strong>Update 2:</strong> It was mentioned in the Solr IRC channel that I could use the
positionIncrementGap setting in
schema.xml to accomplish this with Solr. Apparently if I use a phrase query (with or without slop) then "the increment gap ensures that different values in the same field won't cause an unintended match".
Lucene appends successive additions to the same field in the same document to the end of what it already has in the field.
If you want to treat each member of the List as an entirely separate entity, you should index them in different fields. you could just append the index to the field name you are already using. While I don't have complete information on your needs, of course, doing something like this is probably the better solution.
If you just want to search for the precise text
"Foo Snafu", you can use a <a href="http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/search/PhraseQuery.html" rel="nofollow">PhraseQuery</a>. If you want to be sure your phrasequery doesn't cross from one list item to the next (ie, if you had
"Bar Foo" and
"Snafu Bar" in the index), you could insert some form of delimiting term between each member when writing to the index.