I have two list of words which are ordered by the number of occurrences
The ordering was generated by counting each word in two files sampled at different point in times.
I would like to calculate spearman to see how well the order of the first file was found in the second file.
File a: 1) is 2) went 3) work
File b: 1) is 2) work 3) went
Because the ordering is different I would not achieve a score of 1.0 but yet one that would suggest that these two samples are rather similar
My problem are now missing values. A word of file A might not exist in the file B. Can I use spearman rank in this case? Or would be another correlation measure better suited?
When it comes to rank, in your application, you don't need to have missing values. When a word has an occurrence in one file but not in the other, you can give it last ranking in the other file (or equal last ranking for multiple missing values).
However, I am not sure of the effect on the Spearman value of lots of missing values (lots of tied last ranks). You might instead consider using a standard correlation/regression on the raw <em>relative</em> frequencies, instead of the Spearman coefficient.
Say file x has m=113 words and file y has n=234. We can create a table of relative word frequencies like so:
word x y
is 5/113 23/234 the 4/113 45/234 a 4/113 17/234 farnarkling 1/113 0/234 elbow 0/113 2/234 ... =============================== TOTAL 113/113 234/234
You would then calculate:
word x y u=x*y v=x*x
is 5/113 23/234 115/26442 25/12769 the 4/113 45/234 180/26442 16/12769 a 4/113 17/234 68/26442 16/12769 farnarkling 1/113 0/234 0/26442 1/12769 elbow 0/113 2/234 0/26442 0/12769 ... ======================================================== TOTAL 113/113 234/234 s=(sum of u) t=(sum of v)
Your answer is given by s/t. A value close to m/n implies a good correspondence.
Some possibly useful links are: