我有一个pyspark数据框如下:
Stock | open_price | list_price
A | 100 | 1
B | 200 | 2
C | 300 | 3
我正在尝试使用map和rdd实现以下内容,其中打印出每个和indivial行的股票,open_price * list_price,整个open_price列的总和
(A, 100 , 600)
(B, 400, 600)
(C, 900, 600)
因此使用上表例如第一行:A,100 * 1,100 + 200 + 300
我可以使用下面的代码获得前两列。
stockNames = sqlDF.rdd.map(lambda p: (p.stock,p.open_price*p.open_price) ).collect()
for name in stockNames:
print(name)
然而,当我尝试按以下方式进行总和(p.open_price)时:
stockNames = sqlDF.rdd.map(lambda p: (p.stock,p.open_price*p.open_price,sum(p.open_price)) ).collect()
for name in stockNames:
print(name)
它给了我下面的错误
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 75.0 failed 1 times, most recent failure: Lost task 0.0 in stage 75.0 (TID 518, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "C:\Spark\spark-2.3.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 229, in main
File "C:\Spark\spark-2.3.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 224, in process
File "C:\Spark\spark-2.3.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py", line 372, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "<ipython-input-48-f08584cc31c6>", line 19, in <lambda>
TypeError: 'int' object is not iterable
如何在地图RDD中添加open_price的总和?
提前感谢您,因为我还不熟悉RDD和地图。
答案 0 :(得分:1)
单独计算金额:
df = spark.createDataFrame(
[("A", 100, 1), ("B", 200, 2), ("C", 300, 3)],
("stock", "price", "list_price")
)
total = df.selectExpr("sum(price) AS total")
并添加为列:
from pyspark.sql.functions import lit
df.withColumn("total", lit(total.first()[0])).show()
# +-----+-----+----------+-----+
# |stock|price|list_price|total|
# +-----+-----+----------+-----+
# | A| 100| 1| 600|
# | B| 200| 2| 600|
# | C| 300| 3| 600|
# +-----+-----+----------+-----+
或crossJoin
:
df.crossJoin(total).show()
# +-----+-----+----------+-----+
# |stock|price|list_price|total|
# +-----+-----+----------+-----+
# | A| 100| 1| 600|
# | B| 200| 2| 600|
# | C| 300| 3| 600|
# +-----+-----+----------+-----+
RDD.map
在这里并不适用(您可以使用它来代替withColumn
,但效率很低,我不建议这样做。)