2726 # Inverse Document Frequency Formula

<h3>Question</h3>

I'm having trouble with manually calculating the values for tf-idf. Python scikit keeps spitting out different values than I'd expect.

```idf(term) = log(# of docs/ # of docs with term) ```

If so, won't you get a divide by zero error if there are no docs with the term?

To solve that problem, I read that you do

```log (# of docs / # of docs with term + 1 ) ```

But then if the term is in every document, you get log (n/n+1) which is negative, which doesn't really make sense to me.

What am I not getting?

The trick you describe is actually called Laplace smoothing (or additive, or add-by-one smoothing) and suppose to add the same summand to the other part of the fraction - nominator in your case or denominator in original case.

In other words, you should add 1 to the total number of docs:

```log (# of docs + 1 / # of docs with term + 1) ```

Btw, it is often better to use smaller summand, especially in case of small corpus:

`log (# of docs + a / # of docs with term + a)`,

where a = 0.001 or something like that.