I have a Pandas dataframe as below. What I am trying to do is check if a station has variable
yyy and any other variable on the same day (as in the case of
station1). If this is true I need to delete the whole row containing
Currently I am doing this using
iterrows() and looping to search the days in which this variable appears, changing the variable to something like "delete me", building a new dataframe from this (because <a href="https://stackoverflow.com/questions/15972264/why-doesnt-this-function-take-after-i-iterrows-over-a-pandas-dataframe" rel="nofollow">pandas doesn't support replacing in place</a>) and filtering the new dataframe to get rid of the unwanted rows. This works now because my dataframes are small, but is not likely to scale.
<strong>Question:</strong> This seems like a very "non-Pandas" way to do this, is there some other method of deleting out the unwanted variables?
dateuse station variable1 0 2012-08-12 00:00:00 station1 xxx 1 2012-08-12 00:00:00 station1 yyy 2 2012-08-23 00:00:00 station2 aaa 3 2012-08-23 00:00:00 station3 bbb 4 2012-08-25 00:00:00 station4 ccc 5 2012-08-25 00:00:00 station4 ccc 6 2012-08-25 00:00:00 station4 cccAnswer1:
I might index using a boolean array. We want to delete rows (if I understand what you're after, anyway!) which have
yyy and more than one
We can use
transform to broadcast the size of each
station combination up to the length of the dataframe, and then select the rows in groups which have length > 1. Then we can
& this with where the
>>> multiple = df.groupby(["dateuse", "station"])["variable1"].transform(len) > 1 >>> must_be_isolated = df["variable1"] == "yyy" >>> df[~(multiple & must_be_isolated)] dateuse station variable1 0 2012-08-12 00:00:00 station1 xxx 2 2012-08-23 00:00:00 station2 aaa 3 2012-08-23 00:00:00 station3 bbb 4 2012-08-25 00:00:00 station4 ccc 5 2012-08-25 00:00:00 station4 ccc 6 2012-08-25 00:00:00 station4 ccc