1020

Assignment of Pandas DataFrame with float32 and float64 slow

Question:

Assignments with a Pandas DataFrame with varying float32 and float64 datatypes are for some combinations rather slow the way I do it.

The code below sets up a DataFrame, makes a Numpy/Scipy computation on part of the data, sets up a a new DataFrame by copying the old one and assigns the result from the computation to the new DataFrame:

import pandas as pd import numpy as np from scipy.signal import lfilter N = 1000 M = 1000 def f(dtype1, dtype2): coi = [str(m) for m in range(M)] df = pd.DataFrame([[m for m in range(M)] + ['Hello', 'World'] for n in range(N)], columns=coi + ['A', 'B'], dtype=dtype1) Y = lfilter([1], [0.5, 0.5], df.ix[:, coi]) Y = Y.astype(dtype2) new = pd.DataFrame(df, copy=True) print(new.iloc[0, 0].dtype) print(Y.dtype) new.ix[:, coi] = Y # This statement is considerably slow print(new.iloc[0, 0].dtype) from time import time dtypes = [np.float32, np.float64] for dtype1 in dtypes: for dtype2 in dtypes: print('-' * 10) start_time = time() f(dtype1, dtype2) print(time() - start_time)

The timing result is:

---------- float32 float32 float64 10.1998147964 ---------- float32 float64 float64 10.2371120453 ---------- float64 float32 float64 0.864870071411 ---------- float64 float64 float64 0.866265058517

Here the critical line is new.ix[:, coi] = Y: It is ten times as slow for some combinations.

I can understand that there needs to be some overhead for reallocation when there is a float32 DataFrame and it is assigned a float64. But why is the overhead so dramatic.

Furthermore, the combination of float32 and float32 assignment is also slow and the result is float64, which also bothers me.

Answer1:

Single-column assignments does not change type and iterating with a for-loop over columns seems reasonably fast for non-type-casting assignments, - both float32 and float64. For assignments involving type casting the performance is usually twice as bad as the worst performance for multiple column assignment

import pandas as pd import numpy as np from scipy.signal import lfilter N = 1000 M = 1000 def f(dtype1, dtype2): coi = [str(m) for m in range(M)] df = pd.DataFrame([[m for m in range(M)] + ['Hello', 'World'] for n in range(N)], columns=coi + ['A', 'B'], dtype=dtype1) Y = lfilter([1], [0.5, 0.5], df.ix[:, coi]) Y = Y.astype(dtype2) new = df.copy() print(new.iloc[0, 0].dtype) print(Y.dtype) for n, column in enumerate(coi): # For-loop over columns new! new.ix[:, column] = Y[:, n] print(new.iloc[0, 0].dtype) from time import time dtypes = [np.float32, np.float64] for dtype1 in dtypes: for dtype2 in dtypes: print('-' * 10) start_time = time() f(dtype1, dtype2) print(time() - start_time)

The result is:

---------- float32 float32 float32 0.809890985489 ---------- float32 float64 float64 21.4767119884 ---------- float64 float32 float32 20.5611870289 ---------- float64 float64 float64 0.765362977982

Recommend

  • Can't Center Nav Div
  • Parameter Optimization in Python
  • QSlider and Key Press Event
  • pyplot to 3d pdf
  • Slice syntax to object [duplicate]
  • AttributeError: 'module' object has no attribute 'choice'
  • Detect angle and rotate an image in Python [closed]
  • Access column with in another column header
  • How to define a chi2 value function for arbitrary function?
  • output label Y train shape keras tensorflow 1.4
  • How to plot normal distribution with percentage of data as label in each band/bin?
  • scipy.optimize.curve_fit: not a proper array of floats error
  • Generating smoothed randoms that follow a distribution
  • python opencv SIFT doesn't work for 8 bit images (JPEG)
  • Installing SciPy on windows
  • DSP - get the amplitude of all the frequencies
  • Pointer on ctypes to use OpenCV on Python 3.1
  • Why calculations of eigenvectors of a 2 by 2 matrix with numpy crashes my Python session?
  • Comparing floating point numbers with numpy and scipy
  • Dendrogram or Other Plot from Distance Matrix
  • Tkinter nested mainloop
  • In scipy why doesn't idct(dct(a)) equal to a?
  • Matplotlib rotate image file by X degrees
  • Python function to read variable length blocks of data from file while open
  • C function strchr - How to calculate the position of the character?
  • Converting query results into DataFrame in python
  • vectorized indexing/slicing in numpy/scipy?
  • Wrong labels when plotting a time series pandas dataframe with matplotlib
  • Parse a date string in a specific locale (not timezone!)
  • Excel's Macro-Recorder usage
  • Grails calculated field in SQL
  • How to add date and time under each post in guestbook in google app engine
  • How to handle AllServersUnavailable Exception
  • JSON with duplicate key names losing information when parsed
  • VBA Convert delimiter text file to Excel
  • Matplotlib draw Spline from multiple points
  • Return words with double consecutive letters
  • 0x202A in filename: Why?
  • Reading document lines to the user (python)
  • Python/Django TangoWithDjango Models and Databases