Question

我想在pyspark中使用RDD。从估算pi的简单示例开始：

from pyspark import SparkContext, SparkConf

appName = "tutApp"
master = "local"
NUM_SAMPLE = 100
NUM_PARALLIZE = xrange(0, NUM_SAMPLE)

import random
def sample(p):
    x, y = random.random(), random.random()
    return 1 if x*x + y*y < 1 else 0

count = sc.parallelize(NUM_PARALLIZE).map(sample).reduce(lambda a, b: a+b)
t = (4.0 * count/NUM_SAMPLE)
print "----- Pi is roughly %", t

我想使用相同的方法，在我的情况下，map函数（APR）获取输入，这是示例函数中的数据框而不是x，y。我正在使用pandas，在我的情况下，似乎spark很难通过map函数分配任务。

sc.parallelize(PrepAPR(0,1)).count()

给我节点的数量，这个部分正在工作。

PrepAPR是一个类，我以懒惰的方式返回元组，可以用作map函数的输入。有了函数样本，我想做转换为

.map(lambda t: APR (t)) and t is a tuple of (df,a,b,c).

我收到此错误

SystemError: error return without exception set.

如果有人回答我的问题，我感激不尽。

“”“APR SPark”“”

from pyspark import SparkContext, SparkConf

class PrepAPR:
    def __init__(self, low, high):
        self.current = low
        self.high = high

    def __iter__(self):
        return self

    def prep_data_for_turbine(self):
        turbines=all_turbines[self.current:self.current+1]
        return turbines

    def next(self): # Python 3: def __next__(self)
        if self.current > self.high:
            raise StopIteration
        else:
#             tpl =  self.current 
            tpl = self.prep_data_for_turbine()
            self.current += 1
            return tpl
def sami(t):
    model= MODEL
    df= dFrames["Data"]
    features=FEATURES
    turbines=all_turbines[t:t+1]
    return actual_pred_resid (df, model, features, turbines)


# print sc.parallelize(PrepAPR(0,1)).count()
APR_df = sc.parallelize([1,2]).map(lambda t: sami(t)).count()
#print APR_df.collect()

PySpark：如何将RDD发送到地图函数，并在spark

0 个答案: