from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sal_df = spark.createDataFrame([100,200,300], "integer").toDF("salary")
sal_rdd = spark.sparkContext.parallelize([1000,2000,3000])
def processDataLine(arg1, arg2, df):
def _processDataLine(row):
return df.count() + arg1 + arg2 + row
return _processDataLine
arg1, arg2 = 0, 0
sal_rdd.map(processDataLine(arg1, arg2, sal_df))
错误-
Py4JError: An error occurred while calling o5792.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
**PicklingError**: Could not serialize object: Py4JError: An error occurred while calling o5792.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
我认为我们不能直接将pyspark数据框传递给自定义地图功能。 我需要访问_processDataLine函数中的sal_df进行进一步处理。
我尝试通过
def processDataLine(arg1, arg2, df_json):
def _processDataLine(dataline):
return len(df_json) + arg1 + arg2 + dataline
return _processDataLine
sal_rdd.map(processDataLine(arg1, arg2, sal_df.toJSON())).take(10)
错误-
PicklingError: Could not serialize object: Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
还在工作,
def processDataLine(arg1, arg2, df_list):
def _processDataLine(dataline):
return len(df_list) + arg1 + arg2 + dataline
return _processDataLine
sal_rdd.map(processDataLine(arg1, arg2, sal_df.toJSON().collect())).take(10)
OUTPUT - [1003, 2003, 3003]
如何直接在_processDataLine内部传递和访问pyspark dataframe对象。
答案 0 :(得分:0)
这是您想要的吗?
df1 = spark.createDataFrame([100,200,300], 'integer').toDF('salary')
df2 = spark.createDataFrame([1000,2000,3000], 'integer').toDF('value')
count = df1.count()
arg1 = 0
arg2 = 0
def func(count, arg1, arg2, df):
return df2.withColumn('result', col('value') + lit(arg1) + lit(arg2) + lit(count))
df3 = func(count, arg1, arg2, df2)
df3.show()
+-----+------+
|value|result|
+-----+------+
| 1000| 1003|
| 2000| 2003|
| 3000| 3003|
+-----+------+