如何<mydataframe> GroupBy(“ Fields”)。Apply(Fn)?

时间:2018-07-24 21:31:40

标签: python pandas apache-spark pyspark amazon-emr

我需要为pyspark编写一个自定义GroupBy.Apply()函数。所以我提到了:Introducing Pandas UDF for PySpark

正如我所说的那样-

@pandas_udf(<mydf>.schema, PandasUDFType.GROUPED_MAP)
def tstFn(x):
    # x is a DataFrame of group values
    x['Test'] = x['TotalErrors'].sum()
    return x

现在,我发现尚未安装pyarrow。因此我安装了sudo pip install pyarrow,但失败了ImportError: No module named pyarrow。看起来像pyarrow has to be installed on all the DataNodes

问题:现在,我已经在客户端中安装了pyarrow,即使使用pyspark在本地启动pyspark --deploy-mode client时,为什么仍然出现错误?

默认情况下,看起来像客户端Post 1Post 2。在那种情况下,问题似乎仍然有效!

>>> conf.get("spark.submit.deployMode")
u'client'
>>>

0 个答案:

没有答案