88187

Merge after groupby

Question:

I'm having trouble using pd.merge after groupby. Here's my hypothetical:

import pandas as pd from pandas import DataFrame import numpy as np df1 = DataFrame({'key': [1,1,2,2,3,3], 'var11': np.random.randn(6), 'var12': np.random.randn(6)}) df2 = DataFrame({'key': [1,2,3], 'var21': np.random.randn(3), 'var22': np.random.randn(3)}) #group var11 in df1 by key grouped = df1['var11'].groupby(df1['key']) # calculate the mean of var11 by key grouped = grouped.mean() print grouped key 1 1.399430 2 0.568216 3 -0.612843 dtype: float64 print grouped.index Int64Index([1, 2, 3], dtype='int64') print df2 key var21 var22 0 1 -0.381078 0.224325 1 2 0.836719 -0.565498 2 3 0.323412 -1.616901 df2 = pd.merge(df2, grouped, left_on = 'key', right_index = True)

At this point, I get IndexError: list index out of range.

When using groupby, the grouping variable ('key' in this example) becomes the index for the resultant series, which is why I specify 'right_index = True'. I've tried other syntax without success. Any advice?

Answer1:

I think you should just do this:

In [140]: df2 = pd.merge(df2, pd.DataFrame(grouped, columns=['mean']), left_on='key', right_index=True) print df2 key var21 var22 mean 0 1 0.324476 0.701254 0.400313 1 2 -1.270500 0.055383 -0.293691 2 3 0.804864 0.566747 0.628787 [3 rows x 4 columns]

The reason it didn't work is that grouped is a Series not a DataFrame

Recommend

  • rownames and colnames with specific value
  • SQL server is very slow when retrieving data with C#
  • Pip not installing correctly
  • Cumulative count of blocks of 1 with 0 separators in a binary vector in R
  • Aggregating based on “near” row values
  • Partial/Full-match value in one RDD to values in another RDD
  • How can I transform an array of characters with a few lines of code to a data.frame?
  • Extracting rows from df based on multiple conditions in R
  • Multiply two data frames with similar index in python pandas
  • CONVERT MySQL Query to SQL Server (MSSQL / SQLSRV) (WiTH DISTINCT)
  • How do I calculate a grouped z score in R using dplyr?
  • How to get or calculate size of Azure File/Share or Service
  • Taking mean across rows grouped by a variable in numpy
  • How to model a mixture of finite components from different parametric families with JAGS?
  • Disable/remove close icon on Kendo Grid's default group column
  • SQL Server re-calculate or not?
  • How to name a 'group' check box in Adobe Reader when wanting to fill form by FDF / XFDF
  • Is there any way to use wpdb prepare statements for array implode(' OR ', $myArray)?
  • Receive mouse move even cursor is outside control
  • SF2 Functional tests : “Resetting the container is not allowed when a scope is active”
  • SQL - Select lowest values with group by and order by?
  • Flex items with same property values are rendering in different sizes
  • How to load more than one div at a time
  • Doctrine2 bulk import try to work with another entity
  • Linq Merge lists
  • Make new pandas columns based on pipe-delimited column with possible repeats
  • cygwin cannot exec 'git-add--interactive' permission denied
  • Query to find the duplicates between the name and number in table
  • Grails calculated field in SQL
  • Custom Tabgroup Appcelerator
  • Django: Count of Group Elements
  • Linq Objects Group By & Sum
  • Read text file and split every line in MSBuild
  • How to add a column to a Pandas dataframe made of arrays of the n-preceding values of another column
  • align graphs with different xlab
  • Return words with double consecutive letters
  • Unit Testing MVC Web Application in Visual Studio and Problem with QTAgent
  • embed rChart in Markdown
  • Reading document lines to the user (python)
  • Python/Django TangoWithDjango Models and Databases