21483

Solr score keyword detection rate

Question:

I'm used Solr 6.1

I'm setting the score now,

But I have some issue on score

I just search GCS, the qf set is: title^100 content^70 text^50,

The three fields type all are text_general,

I get first one result score is 1050.8486 and another is 853.08655,

But the first one content is so short in content field and another one is so many in content field,

I just not know why the first score will be many

Two results debugquery content below:

1002.8741 = sum of:\n 1002.8741 = max of:\n 1002.8741 = weight(title:GCS in 1275) [], result of:\n 1002.8741 = score(doc=1275,freq=1.0 = termFreq=1.0\n), product of:\n 100.0 = boost\n 8.513557 = idf(docFreq=27, docCount=137000)\n 1.177973 = tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = parameter b\n 6.3423285 = avgFieldLength\n 4.0 = fieldLength\n 928.3479 = weight(content:GCS in 1275) [], result of:\n 928.3479 = score(doc=1275,freq=2.0 = termFreq=2.0\n), product of:\n 70.0 = boost\n 7.1785564 = idf(docFreq=104, docCount=137000)\n 1.8474623 = tfNorm, computed from:\n 2.0 = termFreq=2.0\n 1.2 = parameter k1\n 0.75 = parameter b\n 176.37256 = avgFieldLength\n 16.0 = fieldLength\n

<hr />

811.1335 = sum of:\n 811.1335 = max of:\n 127.21202 = weight(text:GCS in 9400) [], result of:\n 127.21202 = score(doc=9400,freq=1.0 = termFreq=1.0\n), product of:\n 50.0 = boost\n 7.464645 = idf(docFreq=78, docCount=137000)\n 0.3408388 = tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = parameter b\n 44.69738 = avgFieldLength\n 256.0 = fieldLength\n 811.1335 = weight(title:GCS in 9400) [], result of:\n 811.1335 = score(doc=9400,freq=1.0 = termFreq=1.0\n), product of:\n 100.0 = boost\n 8.513557 = idf(docFreq=27, docCount=137000)\n 0.9527551 = tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = parameter b\n 6.3423285 = avgFieldLength\n 7.111111 = fieldLength\n 174.06395 = weight(content:GCS in 9400) [], result of:\n 174.06395 = score(doc=9400,freq=7.0 = termFreq=7.0\n), product of:\n 70.0 = boost\n 7.1785564 = idf(docFreq=104, docCount=137000)\n 0.34639663 = tfNorm, computed from:\n 7.0 = termFreq=7.0\n 1.2 = parameter k1\n 0.75 = parameter b\n 176.37256 = avgFieldLength\n 7281.778 = fieldLength\n

===========================================================================

I have another question when I use shards the omitNorms it'll not work? why? I found short content score more to long content? the schema is same

First one is from A collection is short content,The other one is B collection and long content :

1158.9161 = sum of:\n 1158.9161 = max of:\n 1158.9161 = weight(title:boeing in 52601) [], result of:\n 1158.9161 = score(doc=52601,freq=1.0 = termFreq=1.0\n), product of:\n 100.0 = boost\n 11.589161 = idf(docFreq=5, docCount=593568)\n 1.0 = tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.0 = parameter b (norms omitted for field)\n 1085.6042 = weight(content:boeing in 52601) [], result of:\n 1085.6042 = score(doc=52601,freq=2.0 = termFreq=2.0\n), product of:\n 70.0 = boost\n 11.279006 = idf(docFreq=7, docCount=593568)\n 1.375 = tfNorm, computed from:\n 2.0 = termFreq=2.0\n 1.2 = parameter k1\n 0.0 = parameter b (norms omitted for field)\n

<hr />

1060.8777 = sum of:\n 1060.8777 = max of:\n 433.1234 = weight(text:boeing in 39406) [], result of:\n 433.1234 = score(doc=39406,freq=1.0 = termFreq=1.0\n), product of:\n 50.0 = boost\n 8.662468 = idf(docFreq=112, docCount=650450)\n 1.0 = tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.0 = parameter b (norms omitted for field)\n 884.746 = weight(title:boeing in 39406) [], result of:\n 884.746 = score(doc=39406,freq=1.0 = termFreq=1.0\n), product of:\n 100.0 = boost\n 8.84746 = idf(docFreq=93, docCount=650450)\n 1.0 = tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.0 = parameter b (norms omitted for field)\n 1060.8777 = weight(content:boeing in 39406) [], result of:\n 1060.8777 = score(doc=39406,freq=7.0 = termFreq=7.0\n), product of:\n 70.0 = boost\n 8.069756 = idf(docFreq=203, docCount=650450)\n 1.8780489 = tfNorm, computed from:\n 7.0 = termFreq=7.0\n 1.2 = parameter k1\n 0.0 = parameter b (norms omitted for field)

Answer1:

The underline similarity Solr 6.1 uses is BM25[1] .

This means the field value length in comparison to the average field length is important. Being more specific, you are using the dismax and you keep in consideration just purely the maximum. So exploring the maximums :

<strong>First document Max:</strong>

1002.8741 = weight(title:GCS in 1275) [], result of:\n 1002.8741 = score(doc=1275,freq=1.0 = termFreq=1.0\n), product of:\n 100.0 = boost\n 8.513557 = idf(docFreq=27, docCount=137000)\n 1.177973 = tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = parameter b\n 6.3423285 = avgFieldLength\n <strong>4.0 = fieldLength</strong>\n

<strong>Second document Max:</strong>

811.1335 = weight(title:GCS in 9400) [], result of:\n 811.1335 = score(doc=9400,freq=1.0 = termFreq=1.0\n), product of:\n 100.0 = boost\n 8.513557 = idf(docFreq=27, docCount=137000)\n 0.9527551 = tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = parameter b\n 6.3423285 = avgFieldLength\n <strong>7.111111 = fieldLength</strong>\n

So the shorter first document title makes the winner. You can play with the dismax/edismax to take in consideration also other factors and not only the maximum[2].

Regards

[1] <a href="http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/" rel="nofollow">http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/</a>

[2] <a href="https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Thetie_TieBreaker_Parameter" rel="nofollow">https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Thetie_TieBreaker_Parameter</a>

Recommend

  • Continuous quantiles of a scatterplot
  • Correctly executing bicubic resampling
  • How to use discontinuous range in SUMIF
  • Subprocess in Reading Serial Port Read
  • Where is this gap/margin coming from?
  • Class 'Memcached' not found in laravel
  • Sort Data.Map by value and get all biggest values
  • Spinner with border
  • Android gcm notification payload vs data payload?
  • JavaScript overloading with a callback
  • How to accept a collection of a base type in WCF
  • Why is `;;` giving me a syntax error in utop?
  • Call a macro with parameters : Python win32com API
  • Greek letters in legend in R
  • Typecasting `this` of a base class template to its derived class
  • Express displaying mongodb documents in Jade
  • Search files(key) in s3 bucket takes longer time
  • JavaScriptCore External Arrays
  • passing parameter to DownloadStringCompletedEventHandler in C#
  • How do I check if System::Collections:ArrayList is empty / nullptr / null?
  • Deduce parent class of inherited method in C++
  • Processing different annotations with the same Processor instance
  • Google analytics measurement protocol session timeout and query time limits
  • crash in __tcf_0
  • Get a trait object reference from a vector
  • LESS CSS how to modify parent property in mixin
  • Passing variable arguments using PowerShell's Start-Process cmdlet
  • Using Laravel 5.4 pusher
  • C# fibonacci function returning errors
  • Cannot get the UserManager class
  • How to get current document uri in XSLT?
  • Convert Type Decimal to Hex (string) in .NET 3.5
  • What is the purpose of TaskExecutor in spring?
  • Unable to decode certificate at client new X509Certificate2()
  • Needing to do .toArray() to get output of mongodb .find() on key name not value
  • Oledb connection string for excel files
  • Counter field in MS Access, how to generate?
  • Symfony2: How to get request parameter
  • ORA-29908: missing primary invocation for ancillary operator
  • Arrays break string types in Julia