
Question:
I have a Pandas dataframe as below. What I am trying to do is check if a station has variable yyy
and any other variable on the same day (as in the case of station1
). If this is true I need to delete the whole row containing yyy
.
Currently I am doing this using iterrows()
and looping to search the days in which this variable appears, changing the variable to something like "delete me", building a new dataframe from this (because <a href="https://stackoverflow.com/questions/15972264/why-doesnt-this-function-take-after-i-iterrows-over-a-pandas-dataframe" rel="nofollow">pandas doesn't support replacing in place</a>) and filtering the new dataframe to get rid of the unwanted rows. This works now because my dataframes are small, but is not likely to scale.
<strong>Question:</strong> This seems like a very "non-Pandas" way to do this, is there some other method of deleting out the unwanted variables?
dateuse station variable1
0 2012-08-12 00:00:00 station1 xxx
1 2012-08-12 00:00:00 station1 yyy
2 2012-08-23 00:00:00 station2 aaa
3 2012-08-23 00:00:00 station3 bbb
4 2012-08-25 00:00:00 station4 ccc
5 2012-08-25 00:00:00 station4 ccc
6 2012-08-25 00:00:00 station4 ccc
Answer1:I might index using a boolean array. We want to delete rows (if I understand what you're after, anyway!) which have yyy
and more than one dateuse
/station
combination.
We can use transform
to broadcast the size of each dateuse
/station
combination up to the length of the dataframe, and then select the rows in groups which have length > 1. Then we can &
this with where the yyy
s are.
>>> multiple = df.groupby(["dateuse", "station"])["variable1"].transform(len) > 1
>>> must_be_isolated = df["variable1"] == "yyy"
>>> df[~(multiple & must_be_isolated)]
dateuse station variable1
0 2012-08-12 00:00:00 station1 xxx
2 2012-08-23 00:00:00 station2 aaa
3 2012-08-23 00:00:00 station3 bbb
4 2012-08-25 00:00:00 station4 ccc
5 2012-08-25 00:00:00 station4 ccc
6 2012-08-25 00:00:00 station4 ccc