As the title says, I'm trying to apply a function over each pair of columns of a dataframe under some conditions. I'm going to try to illustrate this. My df is of the form:

```
Code | 14 | 17 | 19 | ...
w1 | 0 | 5 | 3 | ...
w2 | 2 | 5 | 4 | ...
w3 | 0 | 0 | 5 | ...
```

The Code corresponds to a determined location in a rectangular grid and the ws are different words. I would like to apply cosine similarity measure between each pair of columns only <strong>(EDITED!)</strong> <strong>if the sum of items in one of the columns of the pair is greater thah 5</strong>.

The desired output would be something like:

```
| [14,17] | [14,19] | [14,...] | [17,19] | ...
Sim |cs(14,17) |cs(14,19) |cs(14,...) |cs(17,19)..| ...
```

cs is the result of the cosine similarity for each pair of columns. Is there any suitable method to do this?

Any help would be appreciated :-)

### Answer1:

To apply the cosine metric to each pair from two collections of inputs, you
could use `scipy.spatial.distance.cdist`

. This will be much much faster than
using a double Python loop.

Let one collection be all the columns of `df`

. Let the other collection be only those columns where the sum is greater than 5:

```
import pandas as pd
df = pd.DataFrame({'14':[0,2,0], '17':[5,5,0], '19':[3,4,5]})
mask = df.sum(axis=0) > 5
df2 = df.loc[:, mask]
```

Then all the cosine similarities can be computed with one call to `cdist`

:

```
import scipy.spatial.distance as SSD
values = SSD.cdist(df2.T, df.T, metric='cosine')
# array([[ 2.92893219e-01, 1.11022302e-16, 3.00000000e-01],
# [ 4.34314575e-01, 3.00000000e-01, 1.11022302e-16]])
```

The values can be wrapped in a new DataFrame and reshaped:

```
result = pd.DataFrame(values, columns=df.columns, index=df2.columns)
result = result.stack()
```

<hr>
```
import pandas as pd
import scipy.spatial.distance as SSD
df = pd.DataFrame({'14':[0,2,0], '17':[5,5,0], '19':[3,4,5]})
mask = df.sum(axis=0) > 5
df2 = df.loc[:, mask]
values = SSD.cdist(df2.T, df.T, metric='cosine')
result = pd.DataFrame(values, columns=df.columns, index=df2.columns)
result = result.stack()
mask = result.index.get_level_values(0) != result.index.get_level_values(1)
result = result.loc[mask]
print(result)
```

yields the Series

```
17 14 0.292893
19 0.300000
19 14 0.434315
17 0.300000
```

人吐槽 | 人点赞 |

## Comment