假设我有一个RDD。在这个RDD上,我执行一些输出输出的操作。 现在,我需要输出和原始RDD来执行其他操作。
这样做的方法是什么?
这是我的代码:
rdd = sc.parallelize(input)
rdd1 = rdd.map(...)
...
output1 = rdd1.collect() # output I need
output2 = rdd.map(some operations using output1)
答案 0 :(得分:2)
带窗口函数:
在开始之前,让我们将rdd转换为数据帧:
df = spark.createDataFrame(
sc.parallelize(
[['a', 1, [1, 2]], ['a', 2, [1, 1]], ['a', 3, [2, 2]], ['b', 4, [2, 2]]]
), ['c1', 'c2', 'c3']
)
首先我们计算出现次数:
from pyspark.sql import Window
import pyspark.sql.functions as psf
w1 = Window.partitionBy("c1", df.c3[0])
w2 = Window.partitionBy("c1", df.c3[1])
df1 = df.select(
"c1", "c2", "c3",
psf.count("*").over(w1).alias("count1"),
psf.count("*").over(w2).alias("count2")
)
接下来我们找到最常见的项目:
w1 = Window.partitionBy("c1").orderBy(psf.desc("count1"))
w2 = Window.partitionBy("c1").orderBy(psf.desc("count2"))
df2 = df1.select(
"c1", "c2", "c3",
psf.first(df1.c3[0]).over(w1).alias("most_freq1"),
psf.first(df1.c3[1]).over(w2).alias("most_freq2")
)
然后,我们计算distancte
df3 = df2.withColumn(
"dist",
psf.sqrt((df2.most_freq1 - df2.c3[0])**2 + (df2.most_freq2 - df2.c3[1])**2)
)
df3.show()
+---+---+------+----------+----------+----+
| c1| c2| c3|most_freq1|most_freq2|dist|
+---+---+------+----------+----------+----+
| b| 4|[2, 2]| 2| 2| 0.0|
| a| 1|[1, 2]| 1| 2| 0.0|
| a| 3|[2, 2]| 1| 2| 1.0|
| a| 2|[1, 1]| 1| 2| 1.0|
+---+---+------+----------+----------+----+