1612

Multiple Spelling Results in a Dataframe 1

Question:

I have some data containing spelling errors. I'm correcting them and scoring how close the spelling is using the following code:

import pandas as pd import difflib Li_A = ["potato", "tomato", "squash", "apple", "pear"] Q = {'one' : pd.Series(["potat0", "toma3o", "s5uash", "ap8le", "pea7"], index=['a', 'b', 'c', 'd', 'e']), 'two' : pd.Series(["po1ato", "2omato", "squ0sh", "2pple", "p3ar"], index=['a', 'b', 'c', 'd', 'e'])} df_Q = pd.DataFrame(Q) # Define the function that Corrects & Scores the Spelling def Spelling(ask): a = difflib.get_close_matches(ask, Li_A, n=5, cutoff=0.1) # List comprehension for all values of a b = [difflib.SequenceMatcher(None, ask, x).ratio() for x in a] return pd.Series(a + b) # Apply the function that Corrects & Scores the Spelling df_A = df_Q['one'].apply(Spelling) # Get the column names on the A dataframe c = len(df_A.columns) // 2 df_A.columns = ['Spelling_{}'.format(x) for x in range(c)] + \ ['Score_{}'.format(y) for y in range(c)] # Join the Q & A dataframes df_QA = df_Q.join(df_A)

This gives the result:

df_QA one two Spelling_0 Spelling_1 Spelling_2 Spelling_3 Spelling_4 \ a potat0 po1ato potato tomato pear apple squash b toma3o 2omato tomato potato pear apple squash c s5uash squ0sh squash pear apple tomato potato d ap8le 2pple apple pear tomato squash potato e pea7 p3ar pear potato apple tomato squash Score_0 Score_1 Score_2 Score_3 Score_4 a 0.833333 0.500000 0.400000 0.181818 0.166667 b 0.833333 0.333333 0.200000 0.181818 0.166667 c 0.833333 0.200000 0.181818 0.166667 0.166667 d 0.800000 0.222222 0.181818 0.181818 0.181818 e 0.750000 0.400000 0.444444 0.200000 0.200000

For row "e", "potato" is in row 1 and "apple" in row 2. However, apple got a higher score than potato. This is the wrong way round for my application.

How do I get the higher scoring results the be consistently to the left please?

<strong>Edit 1</strong>: I tried a simpler code:

import difflib Li_A = ["potato", "tomato", "squash", "apple", "pear"] Q = "pea7" A = difflib.get_close_matches(Q, Li_A, n=5, cutoff=0.1)

& got the same result:

A: ['pear', 'potato', 'apple', 'tomato', 'squash']

I also tried a simpler scoring code:

import difflib S1 = difflib.SequenceMatcher(None, "pea7", "potato") R1 = S1.ratio() S2 = difflib.SequenceMatcher(None, "pea7", "apple") R2 = S2.ratio()

& again I got the same result:

R1: 0.4 R2: 0.444

<strong>Edit 2</strong> I tried it with fuzzywuzzy. I got the same result again since fuzzywuzzy depends on difflib:

from fuzzywuzzy import fuzz R1 = fuzz.ratio("pea7", "potato") R2 = fuzz.ratio("pea7", "apple")

Answer1:

SequenceMatcher is correctly calculating the ratio using the method described by Ratcliff and Metzener, 1988. That is, for the number of characters found in common (CC) and the total number of characters in the two strings (CT):

ratio = 2.CC/CT

So it looks like the issue is with get_close_matches

Recommend

  • difflib on Ruby [closed]
  • pandas to_datetime parsing wrong year
  • Populate Edit Text from spinner
  • Autocorrelation returns random results with mic input (using a high pass filter)
  • Pandas conditional on index column
  • Optimization of an all-paths algorithm
  • Declare a function inside another function
  • Separate float into digits
  • Comparing elements in two lists when keeping duplicates is desired in Python
  • NSMutableArray instance used in a block
  • incomplete type 'struct' error in C
  • XOR with Neural Networks (Matlab)
  • pymongo replication secondary readreference not work
  • Swift Initialization Rule Confusion
  • List comprehension with if conditional to get list of files of a specific type
  • KnockoutObservableArray with typed elements in TypeScript
  • Checking if an array in C is symmetric
  • C: Incompatible pointer type initializing
  • Grails calculated field in SQL
  • Sencha Touch 2.0 Controller refs attribute not working?
  • When to use `image` and when to use `Matrix` in Emgu CV?
  • What is the “return” in scheme?
  • How to add date and time under each post in guestbook in google app engine
  • How to add a column to a Pandas dataframe made of arrays of the n-preceding values of another column
  • JSON with duplicate key names losing information when parsed
  • Which linear programming package should I use for high numbers of constraints and “warm starts” [clo
  • Convert array of 8 bytes to signed long in C++
  • Return words with double consecutive letters
  • how to add data labels for bar graph in matlab
  • Understanding cpu registers
  • Turn off referential integrity in Derby? is it possible?
  • Add sale price programmatically to product variations
  • Recursive/Hierarchical Query Using Postgres
  • Running Map reduces the dimensions of the matrices
  • Binding checkboxes to object values in AngularJs
  • Unable to use reactive element in my shiny app
  • Net Present Value in Excel for Grouped Recurring CF
  • jQuery Masonry / Isotope and fluid images: Momentary overlap on window resize
  • How to load view controller without button in storyboard?
  • How do I use LINQ to get all the Items that have a particular SubItem?