28725

Why does groupby operations behave differently

<h3>Question</h3>

When using pandas groupby functions and manipulating the output after the groupby, I've noticed that some functions behave differently in terms of what is returned as the index and how this can be manipulated.

Say we have a dataframe with the following information:

Name Type ID 0 Book1 ebook 1 1 Book2 paper 2 2 Book3 paper 3 3 Book1 ebook 1 4 Book2 paper 2

if we do

<pre class="lang-py prettyprint-override">df.groupby(["Name", "Type"]).sum()

we get a DataFrame:

ID Name Type Book1 ebook 2 Book2 paper 4 Book3 paper 3

which contains a MultiIndex with the columns used in the groupby:

<pre class="lang-py prettyprint-override">MultiIndex([('Book1', 'ebook'), ('Book2', 'paper'), ('Book3', 'paper')], names=['Name', 'Type'])

and one column called ID.

but if I apply a size() function, the result is a Series:

<pre class="lang-py prettyprint-override">Name Type Book1 ebook 2 Book2 paper 2 Book3 paper 1 dtype: int64

And at last, if I do a pct_change(), we get only the resulting DataFrame column:

<pre class="lang-py prettyprint-override"> ID 0 NaN 1 NaN 2 NaN 3 0.0 4 0.0

TL;DR. I want to know why some functions return a Series whilst some others a DataFrame as this made me confused when dealing with different operations within the same DataFrame.


<h3>Answer1:</h3>

The outputs are different because the aggregations are different, and those are what mostly control what is returned. Think of the array equivalent. The data are the same but one "aggregation" returns a single scalar value, the other returns an array the same size as the input

import numpy as np np.array([1,2,3]).sum() #6 np.array([1,2,3]).cumsum() #array([1, 3, 6], dtype=int32)

The same thing goes for aggregations of a DataFrameGroupBy object. All the first part of the groupby does is create a mapping from the DataFrame to the groups. Since this doesn't really do anything there's no reason why the same groupby with a different operation needs to return the same type of output (see above).

gp = df.groupby(["Name", "Type"]) # Haven't done any aggregations yet...

The other important part here is that we have a DataFrameGroupBy object. There are also SeriesGroupBy objects, and that difference can change the return.

gp #<pandas.core.groupby.generic.DataFrameGroupBy object> <hr />

So what happens when you aggregate?

With a DataFrameGroupBy when you choose an aggregation (like sum) that collapses to a single value per group the return will be a DataFrame where the indices are the unique grouping keys. The return is a DataFrame because we provided a DataFrameGroupBy object. DataFrames can have multiple columns and had there been another numeric column it would have aggregated that too, necessitating the DataFrame output.

gp.sum() # ID #Name Type #Book1 ebook 2 #Book2 paper 4 #Book3 paper 3

On the other hand if you use a SeriesGroupBy object (select a single column with []) then you'll get a Series back, again with the index of unique group keys.

df.groupby(["Name", "Type"])['ID'].sum() |------- SeriesGroupBy ----------| #Name Type #Book1 ebook 2 #Book2 paper 4 #Book3 paper 3 #Name: ID, dtype: int64

For aggregations that return arrays (like cumsum, pct_change) a DataFrameGroupBy will return a DataFrame and a SeriesGroupBy will return a Series. But the index is no longer the unique group keys. This is because that would make little sense; typically you'd want to do a calculation within the group and then assign the result back to the original DataFrame. As a result the return is indexed like the original DataFrame you provided for aggregation. This makes creating these columns very simple as pandas handles all of the alignment

df['ID_pct_change'] = gp.pct_change() # Name Type ID ID_pct_change #0 Book1 ebook 1 NaN #1 Book2 paper 2 NaN #2 Book3 paper 3 NaN #3 Book1 ebook 1 0.0 # Calculated from row 0 and aligned. #4 Book2 paper 2 0.0 <hr />

But what about size? That one is a bit <em>weird</em>. The size of a group is a scalar. It doesn't matter how many columns the group has or whether values in those columns are missing, so sending it a DataFrameGroupBy or SeriesGroupBy object is irrelevant. As a result pandas will always return a Series. Again being a group level aggregation that returns a scalar it makes sense to have the return indexed by the unique group keys.

gp.size() #Name Type #Book1 ebook 2 #Book2 paper 2 #Book3 paper 1 #dtype: int64 <hr />

Finally for completeness, though aggregations like sum return a single scalar value it can often be useful to bring those values back to the every row for that group in the original DataFrame. However the return of a normal .sum has a different index, so it won't align. You could merge the values back on the unique keys, but pandas provides the ability to transform these aggregations. Since the intent here is to bring it back to the original DataFrame, the Series/DataFrame is indexed like the original input

gp.transform('sum') # ID #0 2 # Row 0 is Book1 ebook which has a group sum of 2 #1 4 #2 3 #3 2 # Row 3 is also Book1 ebook which has a group sum of 2 #4 4
<h3>Answer2:</h3>

From the document

Size:

<blockquote> Returns Series Number of rows in each group. </blockquote> <hr />

For the sum , since you did not pass the column for sum, so it will return the data frame without the groupby key

df.groupby(["Name", "Type"])['ID'].sum() # return Series <hr />

Function like diff and pct_change is not agg, it will return the value with the same index as original dataframe, for count , mean, sum they are agg, return with the value and groupby key as index

来源:https://stackoverflow.com/questions/61810108/why-does-groupby-operations-behave-differently

Recommend

  • Difference between writerow() and writerows() methods of Python csv module
  • How to search a head then save a new member under that head in asp.net mvc 5?
  • Java Rest call with different user certs
  • Convert certificate to byte array
  • Why context.Wait in StartAsync didn't stop the dialog
  • Youtube API Actionscript 3 and Thumbnails
  • Python: Why this error is coming?
  • Erlang needs to connect to https server?
  • Recognize Patterns of images JPG or PNG
  • Git objects SHA-1 are file contents or file names?
  • password_hash() not working [closed]
  • How to hide 'Add To Cart' for variable products, but keep product variations visible
  • Moving Circle on Live Wallpaper
  • Winston logger not write to file
  • Custom Data Generator for Keras LSTM with TimeSeriesGenerator
  • How to delete first 7 characters of folder name by using batch script?
  • VB.NET and LINQ - Group by in a DataTable
  • Python ctypes: Prototype with LPCSTR [out] parameter
  • How to get a time and Date Separately?
  • Create .java file and compile it to a .class file at runtime
  • Is possible having two COM STA instances of the same component?
  • Find all parks for a given zipcode with google maps
  • 'float' object cannot be interpreted as an integer
  • Ruby on Rails: Get mediaplayer information (iTunes, TRAKTOR, Cog; current song + playlist)
  • Silverlight Event Log in Isolated Storage
  • time column in sqlite using gorm
  • Year over Year Stats from a Crossfilter Dataset
  • Cloud Code: Creating a Parse.File from URL
  • How to integrate angular2-material (alpha 8.2) with angular2-Quickstart app
  • How to handle div that is created dynamically in a table
  • php “page caching” solution suggestions for CMS Applications
  • How to get rgb from transparent pixel in js
  • Jersey serializes character value to ASCII equivalent numeric string
  • CAS 4 - Not able to retrieve the LDAP groups after successful authentication
  • What does the “id” field in an Android “Google Play Music” broadcast intent correspond to?
  • Simple stitching in c++ using opencv