10057

How to define a field type for field that contains both chinese and english

Question:

I am now using Solr to index on a field. This field will contain both Chinese and English. At the same time, I need to use tokenizer NGramTokenizerFactory for searching.

Below is the current field type I defined for the field:

<pre class="lang-xml prettyprint-override"><fieldType name="text_general2" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.NGramTokenizerFactory" minGramSize="1" maxGramSize="15"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.NGramTokenizerFactory" minGramSize="1" maxGramSize="15"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>

I have to set minGramSize="1" to allow searching a single Chinese character. However, this is totally improper for searching an English word.

e.g. If I search "see", it returns "s", "se", "ee", "see", "e"

Therefore, could anyone please tell what is the best way to index a field that contains both Chinese and English?

Answer1:

I'm sure that this isn't the answer you were hoping for, but it's the answer that will actually solve it: Don't use <em>a single field</em> to contain <em>both</em> chinese and english.

Have one field for english and one field for chinese, indexing to the field matching the language of your input content. You can use the <a href="https://wiki.apache.org/solr/LanguageDetection" rel="nofollow">Language Detection</a> feature in an update processor to let Solr decide which field to put the content into during indexing if you don't know the language when indexing.

Searching is then done across both fields (depending on your query handler, possibly using qf), allowing for separate processing of tokens in each language against each field (so that english words doesn't get ngram-ed).

If you have both english and chinese in the same document, process the document to decide the chinese and english parts (for example, iterate over each paragraph and detect language, before indexing to different fields).

Recommend

  • Get the result of an analysis as a JSON in SolR
  • Lucene/Solr - Query Analysis working, but Select handler not
  • IIS 7 Rewrite web.config serverVariables directive not working in sub folder
  • Split string to array from text and html tag
  • Elastic Search: Query string and number not always returning wanted result
  • Simple dynamic call graphs in Java [closed]
  • Excel VBA: search a string to find the first non-text character
  • RegEx help to remove noise words or stop words from string
  • Lucene Query Boosting
  • Memory allocation profilers for managed and unmanaged code?
  • Github ERROR: Repository not found (yes, another one)
  • help('modules') crashing? Not sure how to fix
  • Can XOR be expressed using SKI combinators?
  • Spring security and special characters
  • retrieve vertices with no linked edge in arangodb
  • PHP: When would you need the self:: keyword?
  • How to delete a row from a dynamic generate table using jquery?
  • Python: how to group similar lists together in a list of lists?
  • Rails 2: use form_for to build a form covering multiple objects of the same class
  • Codeigniter doesn't let me update entry, because some fields must be unique
  • Free memory of cv::Mat loaded using FileStorage API
  • Trying to get generic when generic is not available
  • Getting Messege Twice Using IMvxMessenger
  • Change div Background jquery
  • How to get Windows thread pool to call class member function?
  • Linking SubReports Without LinkChild/LinkMaster
  • Bitwise OR returns boolean when one of operands is nil
  • sending mail using smtp is too slow
  • XCode 8, some methods disappeared ? ex: layoutAttributesClass() -> AnyClass
  • Easiest way to encapsulate a HTML5 webpage into an android app?
  • Does armcc optimizes non-volatile variables with -O0?
  • Busy indicator not showing up in wpf window [duplicate]
  • costura.fody for a dll that references another dll
  • Why is Django giving me: 'first_name' is an invalid keyword argument for this function?
  • Observable and ngFor in Angular 2
  • How to Embed XSL into XML
  • How can I use `wmic` in a Windows PE script?
  • UserPrincipal.Current returns apppool on IIS
  • Conditional In-Line CSS for IE and Others?
  • Python/Django TangoWithDjango Models and Databases