2726

Inverse Document Frequency Formula

<h3>Question</h3>

I'm having trouble with manually calculating the values for tf-idf. Python scikit keeps spitting out different values than I'd expect.

I keep reading that

idf(term) = log(# of docs/ # of docs with term)

If so, won't you get a divide by zero error if there are no docs with the term?

To solve that problem, I read that you do

log (# of docs / # of docs with term + 1 )

But then if the term is in every document, you get log (n/n+1) which is negative, which doesn't really make sense to me.

What am I not getting?


<h3>Answer1:</h3>

The trick you describe is actually called Laplace smoothing (or additive, or add-by-one smoothing) and suppose to add the same summand to the other part of the fraction - nominator in your case or denominator in original case.

In other words, you should add 1 to the total number of docs:

log (# of docs + 1 / # of docs with term + 1)

Btw, it is often better to use smaller summand, especially in case of small corpus:

log (# of docs + a / # of docs with term + a),

where a = 0.001 or something like that.

来源:https://stackoverflow.com/questions/32279651/inverse-document-frequency-formula

Recommend

  • Can I Pair Two Sequences Together by a Matching Key?
  • Getting the count of records in a data frame quickly
  • RestSharp PUT XML, RestSharp is sending it as GET?
  • How to signing Android APK
  • JQuery Cycle and JSON with JQuery
  • ProgressDialog.show inside AsyncTask stop my Program from execution
  • How to install an R package to R-3.3.0 from GitHub, which is built on R-3.4.0?
  • Get Network Interface Names and Set All to DHCP Batch Script
  • JQuery and PHP validation problem?
  • Can the use of C++11's 'auto' deteriorate performance or even break the code?
  • maximizing profit for given stock data via DP
  • invoke-webrequest to get complete web page with images
  • How to add blur effect into UIImage with Swift? [duplicate]
  • CABasicAnimation creates empty default value copy of CALayer
  • Planned Contrasts on glmmTMB
  • Spongycastle is missing many algorithms when loaded on android
  • Laravel 5 - Cache remember doesn't work
  • SELECT on JSONField with Django
  • view details for exception in vs 2017
  • Splitting ReportLab table across PDF page (side by side)?
  • Facebook Error (#200) The user hasn't authorized the application to perform this action (PHP)
  • How to load dynamic images in custom ListView
  • LINQ to populate treeview based upon grouping
  • What is the difference between dynamically creating a script tag and statically embed a script tag?
  • How to create subsets of a single set of elements with XSLT?
  • trigger ontouch event programmatically
  • How to merge objects within array based on attribute
  • How to include associated objects using gon in Rails/jQuery
  • How can I ssh into a server that requires 2 password authentication using python's paramiko mod
  • Android: Unable to detect vertical plane
  • How do I add a mouse over tooltip to an Image using .DrawImage()
  • Google App Engine Datastore: Dealing with eventual consistency
  • Background transfer download task failed when app was closed
  • ssh remote server login script
  • Excel VBA : conditional formatting of sheet1 cells from sheet2 values in excel 2007
  • Codeigniniter insert data through models and controller
  • XEP-0166: Jingle protocol implementation for voice/video chat in iOS
  • How to call different template for different category archive page in woocommerce
  • Cross compile glibc for arm, got undefined reference to some unwind functions
  • How to mutate multiple variables without repeating codes?