I am now using Solr to index on a field. This field will contain both Chinese and English. At the same time, I need to use tokenizer NGramTokenizerFactory for searching.
Below is the current field type I defined for the field:<pre class="lang-xml prettyprint-override">
<fieldType name="text_general2" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.NGramTokenizerFactory" minGramSize="1" maxGramSize="15"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.NGramTokenizerFactory" minGramSize="1" maxGramSize="15"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
I have to set
minGramSize="1" to allow searching a single Chinese character. However, this is totally improper for searching an English word.
e.g. If I search "see", it returns "s", "se", "ee", "see", "e"
Therefore, could anyone please tell what is the best way to index a field that contains both Chinese and English?Answer1:
I'm sure that this isn't the answer you were hoping for, but it's the answer that will actually solve it: Don't use <em>a single field</em> to contain <em>both</em> chinese and english.
Have one field for english and one field for chinese, indexing to the field matching the language of your input content. You can use the <a href="https://wiki.apache.org/solr/LanguageDetection" rel="nofollow">Language Detection</a> feature in an update processor to let Solr decide which field to put the content into during indexing if you don't know the language when indexing.
Searching is then done across both fields (depending on your query handler, possibly using
qf), allowing for separate processing of tokens in each language against each field (so that english words doesn't get ngram-ed).
If you have both english and chinese in the same document, process the document to decide the chinese and english parts (for example, iterate over each paragraph and detect language, before indexing to different fields).