74055

pandas efficiently compress columns into column with lists of tuples

<h3>Question</h3>

I have a Dataframe representing groups of exchanges between account holders. The data shows the interacting accounts and items exchanged. Sometimes there is a clear match but sometimes the totals of items exchanged match but you can't tell exactly what amount was exchanged between individuals.

The desired input output is as follows:

id group rx tx 0 A x 50 0 1 B x 0 50 2 A y 210 0 3 B y 0 50 4 C y 0 350 5 D y 190 0 group exchanges 0 x [(B, A, 50)] 1 y [(unk, A, 210), (B, unk, 50), (C, unk, 350), (unk, D, 190)]

Currently I'm using 'groupby' and 'apply' like this:

def sort_out(x): # create the row to be returned y = pd.Series(index=['group','exchanges']) y['group'] = x.group.iloc[0] y['exchanges'] = [] # Find all rx and make tuples list # determine source and destinations sink = [tuple(i) for i in x.loc[x['rx'] != 0][[ 'id', 'rx' ]].to_records(index=True)] source = [tuple(i) for i in x.loc[x['tx'] != 0][[ 'id', 'tx' ]].to_records(index=True)] # find match match = [] for item in source: match = [o for o in sink if o[2] == item[2]] if len(match): y['exchanges'].append((item[1], match[0][1], match[0][2])) sink.remove(match[0]) continue # handle the unmatched elements tx_el = x.loc[~x['tx'].isin(x['rx'])][[ 'id', 'tx']].to_records(index=True) rx_el = x.loc[~x['rx'].isin(x['tx'])][[ 'id', 'rx']].to_records(index=True) [y['exchanges'].append((item[1], 'unk', item[2])) for item in tx_el] [y['exchanges'].append(('unk', item[1], item[2])) for item in rx_el] return y b = a.groupby('group').apply(lambda x: sort_out(x))

This approach takes at best 7 hours on a ~20 million rows. I think the big hurdle is 'groupby'-'apply'. I was recently introduced to 'explode'. From there I looked at 'melt' but it doesn't seem to what I'm looking for. Any suggestions for improvements?

[ANOTHER ATTEMPT]

Based on YOBEN_S suggestions I tried the following. Part of the challenge is matching, part is keeping track of which is transmitting (tx) and which is receiving (rx). So I cheat by adding a tag explicitly i.e. direction ['dir']. I also use a nested ternary but I'm not sure if that's very performant:

a['dir'] = a.apply(lambda x: 't' if x['tx'] !=0 else 'r', axis=1) a[['rx','tx']]=np.sort(a[['rx','tx']].values,axis=1) out = a.drop(['group','rx'],1).apply(tuple,1).groupby([a['group'],a.tx]).agg('sum') \ .apply(lambda x: (x[3],x[0],x[1]) if len(x)==6 else ((x[0],'unk',x[1]) if x[2]=='t' else ('unk',x[0],x[1])) ).groupby(level=0).agg(list)
<h3>Answer1:</h3>

We can try

out=df.drop('group',1).apply(tuple,1).groupby(df['group']).agg(list).to_frame('exchange').reset_index() group exchange 0 x [(A, 50, 0), (B, 0, 50)] 1 y [(A, 210, 0), (B, 0, 50), (C, 0, 350), (D, 190...

Update

df[['rx','tx']]=np.sort(df[['rx','tx']].values,axis=1) out=df.drop(['group','rx'],1).apply(list,1).groupby([df['group'],df.tx]).agg('sum').apply(set).groupby(level=0).agg(list) out group x [{50, A, B}] y [{50, B}, {D, 190}, {210, A}, {C, 350}] dtype: object

来源:https://stackoverflow.com/questions/62247756/pandas-efficiently-compress-columns-into-column-with-lists-of-tuples

Recommend

  • python isnull() or isna() both are not working
  • Turning progress display off globally for angular-cli v6
  • Azure App Service Mobile - Try It Out not visible
  • c++ rvalue of moveable type in constructor
  • How to lock and unlock a SQL SERVER table?
  • Compute Euclidean distance between rows of two pandas dataframes
  • JS insert into array at specific index [duplicate]
  • Fastest way to convert IEnumerable to List in C#
  • Add click to imagebutton inside gridview dynamically
  • After one click set two different image in two div in javascript
  • SpagoBI + Firebird DataSource (The result set is closed)
  • Applescript: Am I able to save a text file using a variable for the filename and file path?
  • How to define or support a code language on Visual Studio
  • getsockname return invalid address
  • Truncate a String at Compile-Time
  • Find nearest known location: Google Reverse Geocoding
  • Android: Compressing images creates black borders on left and top margin
  • How to force Delphi to use D8.bat instead of dx.bat to compile Java 1.8 bytecode into DEX bytecode
  • ODR of template class with static constexpr member
  • Creating my Custom Unique Key
  • LIBSODIUM decrypt data inside mysql query like did with AES_DECRYPT
  • How to set a dynamic form fields to dirty or touched with angular?
  • Coin change recursive approach
  • Handle query parameters recursively using htaccess
  • Overwrite text file programmatically
  • Django self join , How to convert this query to ORM query
  • didSelectItemAtIndexPath of UICollectionView inside of a UIScrollView is not getting called
  • How can i use JQuery fadeTo() in IE 7?
  • How to close a WebView with double-click?
  • Getting Microsoft Calibri font on Amazon EC2 ubuntu
  • How to make Rss News Reader application in android …? [closed]
  • SQL Server 2012 not showing unicode character in results
  • how to get the location(lat/lng) on google maps v3 from the location(x,y)
  • Change cell value based on cell color in google spreadsheet
  • How to encrypt Connectionstring written in web.config from codebehind?
  • Can someone explain this Java code (formatting the output using System.out.format) to me?
  • What does the “id” field in an Android “Google Play Music” broadcast intent correspond to?
  • XSLT Transformation to validate rules in XML document