When using pandas groupby functions and manipulating the output after the groupby, I've noticed that some functions behave differently in terms of what is returned as the index and how this can be manipulated.
Say we have a dataframe with the following information:
Name Type ID 0 Book1 ebook 1 1 Book2 paper 2 2 Book3 paper 3 3 Book1 ebook 1 4 Book2 paper 2
if we do<pre class="lang-py prettyprint-override">
we get a
ID Name Type Book1 ebook 2 Book2 paper 4 Book3 paper 3
which contains a MultiIndex with the columns used in the groupby:<pre class="lang-py prettyprint-override">
MultiIndex([('Book1', 'ebook'), ('Book2', 'paper'), ('Book3', 'paper')], names=['Name', 'Type'])
and one column called
but if I apply a
size() function, the result is a
Name Type Book1 ebook 2 Book2 paper 2 Book3 paper 1 dtype: int64
And at last, if I do a
pct_change(), we get only the resulting DataFrame column:
ID 0 NaN 1 NaN 2 NaN 3 0.0 4 0.0
TL;DR. I want to know why some functions return a
Series whilst some others a
DataFrame as this made me confused when dealing with different operations within the same DataFrame.
The outputs are different because the aggregations are different, and those are what mostly control what is returned. Think of the array equivalent. The data are the same but one "aggregation" returns a single scalar value, the other returns an array the same size as the input
import numpy as np np.array([1,2,3]).sum() #6 np.array([1,2,3]).cumsum() #array([1, 3, 6], dtype=int32)
The same thing goes for aggregations of a DataFrameGroupBy object. All the first part of the
groupby does is create a mapping from the DataFrame to the groups. Since this doesn't really do anything there's no reason why the same groupby with a different operation needs to return the same type of output (see above).
gp = df.groupby(["Name", "Type"]) # Haven't done any aggregations yet...
The other important part here is that we have a DataFrameGroupBy object. There are also SeriesGroupBy objects, and that difference can change the return.
gp #<pandas.core.groupby.generic.DataFrameGroupBy object>
So what happens when you aggregate?
DataFrameGroupBy when you choose an aggregation (like
sum) that collapses to a single value per group the return will be a DataFrame where the indices are the unique grouping keys. The return is a
DataFrame because we provided a DataFrameGroupBy object. DataFrames can have multiple columns and had there been another numeric column it would have aggregated that too, necessitating the DataFrame output.
gp.sum() # ID #Name Type #Book1 ebook 2 #Book2 paper 4 #Book3 paper 3
On the other hand if you use a SeriesGroupBy object (select a single column with
) then you'll get a Series back, again with the index of unique group keys.
df.groupby(["Name", "Type"])['ID'].sum() |------- SeriesGroupBy ----------| #Name Type #Book1 ebook 2 #Book2 paper 4 #Book3 paper 3 #Name: ID, dtype: int64
For aggregations that return arrays (like
pct_change) a DataFrameGroupBy will return a DataFrame and a SeriesGroupBy will return a Series. But the index is no longer the unique group keys. This is because that would make little sense; typically you'd want to do a calculation within the group and then assign the result back to the original DataFrame. As a result the return is indexed like the original DataFrame you provided for aggregation. This makes creating these columns very simple as pandas handles all of the alignment
df['ID_pct_change'] = gp.pct_change() # Name Type ID ID_pct_change #0 Book1 ebook 1 NaN #1 Book2 paper 2 NaN #2 Book3 paper 3 NaN #3 Book1 ebook 1 0.0 # Calculated from row 0 and aligned. #4 Book2 paper 2 0.0
But what about
size? That one is a bit <em>weird</em>. The
size of a group is a scalar. It doesn't matter how many columns the group has or whether values in those columns are missing, so sending it a DataFrameGroupBy or SeriesGroupBy object is irrelevant. As a result
pandas will always return a
Series. Again being a group level aggregation that returns a scalar it makes sense to have the return indexed by the unique group keys.
gp.size() #Name Type #Book1 ebook 2 #Book2 paper 2 #Book3 paper 1 #dtype: int64
Finally for completeness, though aggregations like
sum return a single scalar value it can often be useful to bring those values back to the every row for that group in the original DataFrame. However the return of a normal
.sum has a different index, so it won't align. You could
merge the values back on the unique keys, but
pandas provides the ability to
transform these aggregations. Since the intent here is to bring it back to the original DataFrame, the Series/DataFrame is indexed like the original input
gp.transform('sum') # ID #0 2 # Row 0 is Book1 ebook which has a group sum of 2 #1 4 #2 3 #3 2 # Row 3 is also Book1 ebook which has a group sum of 2 #4 4
From the document
</blockquote> <hr />
Returns Series Number of rows in each group.
sum , since you did not pass the column for sum, so it will return the data frame without the groupby key
df.groupby(["Name", "Type"])['ID'].sum() # return Series
pct_change is not agg, it will return the value with the same
index as original dataframe, for
sum they are agg, return with the value and
groupby key as index