如何将pyspark数据框作为参数传递给自定义地图函数(多个地图参数)

时间:2020-09-06 05:12:30

标签: apache-spark pyspark apache-spark-sql pyspark-dataframes

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

sal_df = spark.createDataFrame([100,200,300], "integer").toDF("salary")
sal_rdd = spark.sparkContext.parallelize([1000,2000,3000])

def processDataLine(arg1, arg2, df):
    def _processDataLine(row):
        return df.count() + arg1 + arg2 + row
    return _processDataLine


arg1, arg2 = 0, 0
sal_rdd.map(processDataLine(arg1, arg2, sal_df))

错误-

Py4JError: An error occurred while calling o5792.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
    at py4j.Gateway.invoke(Gateway.java:274)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

**PicklingError**: Could not serialize object: Py4JError: An error occurred while calling o5792.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
    at py4j.Gateway.invoke(Gateway.java:274)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

我认为我们不能直接将pyspark数据框传递给自定义地图功能。 我需要访问_processDataLine函数中的sal_df进行进一步处理。

我尝试通过

def processDataLine(arg1, arg2, df_json):
    def _processDataLine(dataline):
        return len(df_json) + arg1 + arg2 + dataline
    return _processDataLine

sal_rdd.map(processDataLine(arg1, arg2, sal_df.toJSON())).take(10)

错误-

PicklingError: Could not serialize object: Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.

还在工作,

def processDataLine(arg1, arg2, df_list):
    def _processDataLine(dataline):
        return len(df_list) + arg1 + arg2 + dataline
    return _processDataLine

sal_rdd.map(processDataLine(arg1, arg2, sal_df.toJSON().collect())).take(10)

OUTPUT - [1003, 2003, 3003]

如何直接在_processDataLine内部传递和访问pyspark dataframe对象。

1 个答案:

答案 0 :(得分:0)

这是您想要的吗?

df1 = spark.createDataFrame([100,200,300], 'integer').toDF('salary')
df2 = spark.createDataFrame([1000,2000,3000], 'integer').toDF('value')

count = df1.count()
arg1 = 0
arg2 = 0

def func(count, arg1, arg2, df):
    return df2.withColumn('result', col('value') + lit(arg1) + lit(arg2) + lit(count))

df3 = func(count, arg1, arg2, df2)
df3.show()

+-----+------+
|value|result|
+-----+------+
| 1000|  1003|
| 2000|  2003|
| 3000|  3003|
+-----+------+