我有一些Python代码通过rpy2将数据帧传递给R,然后R对其进行处理,然后通过com.load_data
将生成的data.frame作为PANDAS数据帧拉回到R。
问题是,对com.load_data
的调用在单个Python进程中运行良好,但是当同时在多个multiprocessing.Process
进程中运行相同的代码串时它会崩溃。我从Python中得到以下错误消息:
File "C:\\Python27\\lib\\site-packages\\pandas\\rpy\\common.py", line 29, in load_data
r.data(name) TypeError: 'DataFrame' object is not callable'
所以我的问题是,rpy2实际上设计是不是可以并行运行,还是仅仅是load_data
函数中的错误?我只是假设每个Python进程都会获得自己独立的R会话。据我所知,唯一的解决方法是让R将输出写入文本文件,相应的Python进程可以打开并继续处理。但这非常笨重。
使用一些代码进行更新:
from rpy2.robjects.packages import importr
import rpy2.robjects as ro
import pandas as pd
import pandas.rpy.common as com
# Load C50 library into R environment
C50 = importr('C50')
...
# PANDAS data frame containing test dataset
testing = pd.DataFrame(testing)
# Pass testing dataset to R
rtesting = com.convert_to_r_dataframe(testing)
ro.globalenv['test'] = rtesting
# Strip "AsIs" from each column in the R data frame
# so that predict.C5.0 will work
for c in range(len(testing.columns)):
ro.r('''class(test[,{0}])=class(test[,{0}])[-match("AsIs", class(test[,{0}]))]'''.format(c+1))
# Make predictions on test dataset (res is pre-existing C5.0 tree)
ro.r('''preds=predict.C5.0(res, newdata=test)''')
ro.r('''preds=as.data.frame(preds)''')
# Get the predictions from R
preds = com.load_data('preds') ### Crashes here when code is run on several processes concurrently
#Further processing as necessary
...
答案 0 :(得分:5)
rpy
通过并行运行Python进程和R进程,并在它们之间交换信息来工作。它没有考虑使用multiprocess
并行调用R调用。所以在实践中,每个python进程都连接到同一个R进程。这可能会导致您看到的问题。
解决此问题的一种方法是在R中实现并行处理,而不是在Python中。然后,您将所有内容一次发送到R,这将并行处理它,结果将被发送回Python。
答案 1 :(得分:4)
以下(python3)代码表明,至少在使用multiprocessing.Pool的情况下,每个工作进程都会生成单独的R进程(@lgutier就是这样吗?)
import os
import multiprocessing
import time
num_processes = 3
import rpy2.robjects as robjects
def test_r_process(pause):
n_called = robjects.r("times.called <- times.called + 1")[0]
r_pid = robjects.r("Sys.getpid()")[0]
print("R process for worker {} is {}. Pausing for {} seconds.".format(
os.getpid(), r_pid, pause))
time.sleep(pause)
return(r_pid, n_called)
pause_secs = [2,4,3,6,7,2,3,5,1,2,3,3]
results = {}
robjects.r("times.called <- 0")
with multiprocessing.Pool(processes=num_processes) as pool:
for proc, n_called in pool.imap_unordered(test_r_process, pause_secs):
results[proc]=max(n_called, results.get(proc) or 0)
print("The test function should have been called {} times".format(len(pause_secs)))
for pid,called in results.items():
print("R process {} was called {} times".format(pid,called))
在我的OS X笔记本电脑上产生类似
的内容R process for worker 22535 is 22535. Pausing for 3 seconds.
R process for worker 22533 is 22533. Pausing for 2 seconds.
R process for worker 22533 is 22533. Pausing for 6 seconds.
R process for worker 22535 is 22535. Pausing for 7 seconds.
R process for worker 22534 is 22534. Pausing for 2 seconds.
R process for worker 22534 is 22534. Pausing for 3 seconds.
R process for worker 22533 is 22533. Pausing for 5 seconds.
R process for worker 22534 is 22534. Pausing for 1 seconds.
R process for worker 22535 is 22535. Pausing for 2 seconds.
R process for worker 22534 is 22534. Pausing for 3 seconds.
R process for worker 22535 is 22535. Pausing for 3 seconds.
The test function should have been called 12 times
R process 22533 was called 3.0 times
R process 22534 was called 5.0 times
R process 22535 was called 4.0 times