75905

Pandas: Apply function over each pair of columns under constraints

As the title says, I'm trying to apply a function over each pair of columns of a dataframe under some conditions. I'm going to try to illustrate this. My df is of the form:

Code | 14 | 17 | 19 | ... w1 | 0 | 5 | 3 | ... w2 | 2 | 5 | 4 | ... w3 | 0 | 0 | 5 | ...

The Code corresponds to a determined location in a rectangular grid and the ws are different words. I would like to apply cosine similarity measure between each pair of columns only <strong>(EDITED!)</strong> <strong>if the sum of items in one of the columns of the pair is greater thah 5</strong>.

The desired output would be something like:

| [14,17] | [14,19] | [14,...] | [17,19] | ... Sim |cs(14,17) |cs(14,19) |cs(14,...) |cs(17,19)..| ...

cs is the result of the cosine similarity for each pair of columns. Is there any suitable method to do this?

Any help would be appreciated :-)

Answer1:

To apply the cosine metric to each pair from two collections of inputs, you could use scipy.spatial.distance.cdist. This will be much much faster than using a double Python loop.

Let one collection be all the columns of df. Let the other collection be only those columns where the sum is greater than 5:

import pandas as pd df = pd.DataFrame({'14':[0,2,0], '17':[5,5,0], '19':[3,4,5]}) mask = df.sum(axis=0) > 5 df2 = df.loc[:, mask]

Then all the cosine similarities can be computed with one call to cdist:

import scipy.spatial.distance as SSD values = SSD.cdist(df2.T, df.T, metric='cosine') # array([[ 2.92893219e-01, 1.11022302e-16, 3.00000000e-01], # [ 4.34314575e-01, 3.00000000e-01, 1.11022302e-16]])

The values can be wrapped in a new DataFrame and reshaped:

result = pd.DataFrame(values, columns=df.columns, index=df2.columns) result = result.stack() <hr> import pandas as pd import scipy.spatial.distance as SSD df = pd.DataFrame({'14':[0,2,0], '17':[5,5,0], '19':[3,4,5]}) mask = df.sum(axis=0) > 5 df2 = df.loc[:, mask] values = SSD.cdist(df2.T, df.T, metric='cosine') result = pd.DataFrame(values, columns=df.columns, index=df2.columns) result = result.stack() mask = result.index.get_level_values(0) != result.index.get_level_values(1) result = result.loc[mask] print(result)

yields the Series

17 14 0.292893 19 0.300000 19 14 0.434315 17 0.300000

Recommend

  • R Compare each data value of a column to rest of the values in the column?
  • How to do un-normalized 2D Cross Correlation in IPP
  • CoffeeScript, Node.js, MongoDB and JasperReports, is it possible?
  • Way to represent unknown file size in FTP LIST?
  • pthread_create memory leak
  • Hierarchical Clustering Large Sparse Distance Matrix R
  • Java catching exceptions and subclases
  • Overflow: hidden but i still want to have the empty scrollbar
  • Do stored procedures have the ability to delete a file from the OS?
  • send data back from jsp iterator to struts action class
  • Time taken for Hadoop job to execute
  • Allocating a 2D contiguous array within a function
  • Laravel lmutator $this->attributes return 'Undefined index: id'
  • c++ search a vector for element first seen position
  • Simplify where clause with repeated associated type restrictions
  • How to concat Pandas dataframe columns
  • SQL - Select lowest values with group by and order by?
  • How to get the date of next specified day of week
  • Is it possible to define rest argument in OCaml?
  • Python cosine function precision [duplicate]
  • R convert summary result (statistics with all dataframe columns) into dataframe
  • NUnit 3.0 TestCase const custom object arguments
  • Approximate Order-Preserving Huffman Code
  • ASP.NET MVC 2 Preview 2 - display directory list rather than home/index
  • Eliminate partial duplicate rows from result set
  • ListItem.Attributes.Add not working
  • Grails calculated field in SQL
  • one Local Olampyad Questions on Informatic in 2011
  • Is it possible to access block's scope in method?
  • Java Scanner input dilemma. Automatically inputs without allowing user to type
  • Is possible to count alias result on mysql
  • Join two tables and save into third-sql
  • Deserializing XML into class C#
  • When should I choose bucket sort over other sorting algorithms?
  • Function pointer “assignment from incompatible pointer type” only when using vararg ellipsis
  • php design question - will a Helper help here?
  • python draw pie shapes with colour filled
  • How to Embed XSL into XML
  • jQuery Masonry / Isotope and fluid images: Momentary overlap on window resize
  • How do I use LINQ to get all the Items that have a particular SubItem?