10957

Pandas Dataframe find intervals and count occurances

I got a list of different events with mixed occurrences. For instance the event1 might occur three times, then another event and later on event1 will occur again.

What I need is the intervals for each event and the number of occurrences of that event in those intervals.

values = { '2017-11-28 11:00': 'event1', '2017-11-28 11:01': 'event1', '2017-11-28 11:02': 'event1', '2017-11-28 11:03': 'event2', '2017-11-28 11:04': 'event2', '2017-11-28 11:05': 'event1', '2017-11-28 11:06': 'event1', '2017-11-28 11:07': 'event1', '2017-11-28 11:08': 'event3', '2017-11-28 11:09': 'event3', '2017-11-28 11:10': 'event2', } import pandas as pd df = pd.DataFrame.from_dict(values, orient='index').reset_index() df.columns = ['time', 'event'] df['time'] = df['time'].apply(pd.to_datetime) df.set_index('time', inplace=True) df.sort_index(inplace=True) df.head()

The expected result is:

occurrences = [ {'start':'2017-11-28 11:00', 'end':'2017-11-28 11:02', 'event':'event1', 'count':3}, {'start':'2017-11-28 11:03', 'end':'2017-11-28 11:04', 'event':'event2', 'count':2}, {'start':'2017-11-28 11:05', 'end':'2017-11-28 11:07', 'event':'event1', 'count':3}, {'start':'2017-11-28 11:08', 'end':'2017-11-28 11:09', 'event':'event3', 'count':2}, {'start':'2017-11-28 11:10', 'end':'2017-11-28 11:10', 'event':'event2', 'count':1}, ]

I was thinking to use <strong>pd.merge_asof</strong> to find the start/end times of the intervals and the use <strong>pd.cut</strong> (as explained here) for <strong>groupby and count</strong>. But somehow I'm stuck. Any help is appreciated.

Answer1:

Try the following approach:

In [68]: x = df.reset_index() In [69]: (x.groupby(x.event.ne(x.event.shift()).cumsum()) ...: .apply(lambda x: ...: pd.DataFrame({ ...: 'start':[x['time'].min()], ...: 'end':[x['time'].min()], ...: 'event':[x['event'].iloc[0]], ...: 'count':[len(x)]}) ...: ) ...: .reset_index(drop=True) ...: .to_dict('r') ...: ) Out[69]: [{'count': 3, 'end': Timestamp('2017-11-28 11:00:00'), 'event': 'event1', 'start': Timestamp('2017-11-28 11:00:00')}, {'count': 2, 'end': Timestamp('2017-11-28 11:03:00'), 'event': 'event2', 'start': Timestamp('2017-11-28 11:03:00')}, {'count': 3, 'end': Timestamp('2017-11-28 11:05:00'), 'event': 'event1', 'start': Timestamp('2017-11-28 11:05:00')}, {'count': 2, 'end': Timestamp('2017-11-28 11:08:00'), 'event': 'event3', 'start': Timestamp('2017-11-28 11:08:00')}, {'count': 1, 'end': Timestamp('2017-11-28 11:10:00'), 'event': 'event2', 'start': Timestamp('2017-11-28 11:10:00')}]

or the following if you want to have time column as strings:

In [75]: (x.groupby(x.event.ne(x.event.shift()).cumsum()) ...: .apply(lambda x: ...: pd.DataFrame({ ...: 'start':[x['time'].min().strftime('%Y-%m-%d %H:%M:%S')], ...: 'end':[x['time'].min().strftime('%Y-%m-%d %H:%M:%S')], ...: 'event':[x['event'].iloc[0]], ...: 'count':[len(x)]}) ...: ) ...: .reset_index(drop=True) ...: .to_dict('r') ...: ) Out[75]: [{'count': 3, 'end': '2017-11-28 11:00:00', 'event': 'event1', 'start': '2017-11-28 11:00:00'}, {'count': 2, 'end': '2017-11-28 11:03:00', 'event': 'event2', 'start': '2017-11-28 11:03:00'}, {'count': 3, 'end': '2017-11-28 11:05:00', 'event': 'event1', 'start': '2017-11-28 11:05:00'}, {'count': 2, 'end': '2017-11-28 11:08:00', 'event': 'event3', 'start': '2017-11-28 11:08:00'}, {'count': 1, 'end': '2017-11-28 11:10:00', 'event': 'event2', 'start': '2017-11-28 11:10:00'}]

Answer2:

Here are two solutions. The first one is based on the link provided by vivek-harikrishnan and explained here. It creates continuous numbers for the intervals and cumulatively counts the occurrences within such intervals.

#%% first solution # create intervals and count occurrences per interval df['interval'] = (df['event'] != df['event'].shift(1)).astype(int).cumsum() df['count'] = df.groupby(['event', 'interval']).cumcount() + 1 # now group by intervals df.groupby('interval').last()

The second solution is based on the answer above given by maxu. Similar to the first idea it also creates interval numbers but also finds the start/end timestamp for such intervals.

#%% second solution df = df.reset_index() # create intervals df = df.groupby(df['event'].ne(df['event'].shift()).cumsum()) # calc start/end times and count occurances at the same time df.apply(lambda x: pd.DataFrame({ 'start':[x['time'].min()], 'end':[x['time'].max()], 'event':[x['event'].iloc[0]], 'count':[len(x)]})).reset_index(drop=True)

Recommend

  • How to count nan values in a pandas DataFrame?
  • Pandas Dataframe selecting groups with minimal cardinality
  • Pandas: groupby forward fill with datetime index
  • DataError: No numeric types using mean aggregate function but not sum?
  • how to convert a data frame with a list in the value to a big data frame with the each level as a si
  • How to rearrange table in pandas in a format suitable for analysis in R?
  • How to use pandas to read a line from a csv, proceed a VLOOKUP action and save the results into anot
  • How can I use python distutils to cross compile an extension module to a different architecture?
  • Converting long table to wide and creating columns according to the rows
  • Pandas: SettingWithCopyWarning, trying to understand how to write the code better, not just whether
  • Change column in df from column from another
  • How to sort Pandas DataFrame both by MultiIndex and by value?
  • merging multiple columns into one columns in pandas
  • Python interpolate not working on rows
  • How to add a column to a DataFrame based on a multi-index map
  • TypeError: unsupported operand type(s) for -: 'str' and 'str' in python 3.x Anac
  • Pandas - find nearest dates between two DataFrames without loop
  • Using : for multiple slicing in list or numpy array
  • Pandas groupby to to_csv
  • iOS 6 dateFromString returns wrong date
  • WPF version of .ScaleControl?
  • Primefaces :radioButton inside a ui:repeat
  • Python delete lines of text line #1 till regex
  • R convert summary result (statistics with all dataframe columns) into dataframe
  • Tell Git to stop prompting me for conflicts when none really exist?
  • Breaking out column by groups in Pandas
  • Reduction and collapse clauses in OMP have some confusing points
  • Unable to get column index with table.getColumn method using custom table Model
  • MongoDb aggregation
  • How to use remove-erase idiom for removing empty vectors in a vector?
  • Jquery UI tool tip close icon
  • R - Combining Columns to String Based on Logical Match
  • Display issues when we change from one jquery mobile page to another in firefox
  • WPF Applying a trigger on binding failure
  • sending mail using smtp is too slow
  • Busy indicator not showing up in wpf window [duplicate]
  • reshape alternating columns in less time and using less memory
  • Why is Django giving me: 'first_name' is an invalid keyword argument for this function?
  • How to Embed XSL into XML
  • How can I use `wmic` in a Windows PE script?