Pandas with rpy2 and multiprocessing


I'm trying to speedup a process using Pandas and R.

Suppose that I have the following dataframe:

import pandas as pd from random import randint df = pd.DataFrame({'mpg': [randint(1, 9) for x in xrange(10)], 'wt': [randint(1, 9)*10 for x in xrange(10)], 'cyl': [randint(1, 9)*100 for x in xrange(10)]}) df mpg wt cyl 0 3 40 100 1 6 30 200 2 7 70 800 3 3 50 200 4 7 50 400 5 4 10 400 6 3 70 500 7 8 30 200 8 3 40 800 9 6 60 200

then, I use rpy2 to model some data:

import rpy2.robjects.packages as rpackages import rpy2.robjects as robjects from rpy2.robjects import pandas2ri pandas2ri.activate() base = rpackages.importr('base') stats = rpackages.importr('stats') formula = 'mpg ~ wt + cyl' fit_full = stats.lm(formula, data=df)

after this I make some predictions:

rfits = stats.predict(fit_full, newdata=df)

This code runs without problems for a small dataframe, but actually I have a big dataframe with millions of lines and I'm trying to speedup the prediction part using other rpy2 models, but unfortunately this takes a long time to process.

I've tried to use for the first time the multiprocessing library for this task without success:

import multiprocessing as mp pool = mp.Pool(processes=4) rfits = pool.map(predict(fit_full, newdata=df))

but probably I'm doing something wrong since I can't see any speed improvement.

I think the main problem here, is because I'm trying to apply the pool.map to rpy2 function and not a Python predefined function. Probably there is some workaround solution for this without using the multiprocessing library, but I can't see any.

Any help would be greatly appreciated. Thanks in advance.


Have you tried using StatsModels?


<strong><a href="http://statsmodels.sourceforge.net/devel/example_formulas.html" rel="nofollow">Fitting models using R-style formulas</a></strong> Since version 0.5.0, statsmodels allows users to fit statistical models using R-style formulas. Internally, statsmodels uses the patsy package to convert formulas and data to the matrices that are used in model fitting. The formula framework is quite powerful; this tutorial only scratches the surface. A full description of the formula language can be found in the patsy docs

</blockquote> import statsmodels.formula.api as smf formula = 'mpg ~ wt + cyl' model = smf.ols(formula=formula, data=df) params = model.fit().params >>> params params Intercept 5.752803 wt 0.037770 cyl -0.004112 >>> model.predict(params, exog=df) array([ 1725.83759267, 2876.50148582, 575.25352613, 1150.6605447 , 1150.51281171, 3451.54178359, 575.53800931, 575.4146529 , 2876.58372342, 5177.46831077])


