我有以下两个DataFrame:
["hello","world"]
["stack","overflow"]
["hello","alice"]
["sample","text"]
DF1:
["big","world"]
["sample","overflow","alice","text","bob"]
["hello", "sample"]
DF2:
["hello","world"]
对于df1中的每一行,我想计算数组中所有单词出现在df2中的次数。
例如,df1中的第一行是["hello","world"]
。现在,我想检查df2是否为| ARRAY | INTERSECTION | LEN(INTERSECTION)|
|["big","world"] |["world"] | 1 |
|["sample","overflow","alice","text","bob"] |[] | 0 |
|["hello","sample"] |["hello"] | 1 |
与df2中每一行的交集。
sum(len(interesection))
现在,我想要返回 ARRAY INTERSECTION_TOTAL
| ["hello","world"] | 2 |
| ["stack","overflow"] | 1 |
| ["hello","alice"] | 2 |
| ["sample","text"] | 3 |
。最终我希望得到的df1看起来像这样:
df1结果:
=IF(C1-INT(C1)<=0.2,0,1)+INT(C1)
我该如何解决这个问题?
答案 0 :(得分:1)
我首先关注避免使用笛卡尔积。我试图爆炸并加入
from pyspark.sql.functions import explode, monotonically_increasing_id
df1_ = (df1.toDF("words")
.withColumn("id_1", monotonically_increasing_id())
.select("*", explode("words").alias("word")))
df2_ = (df2.toDF("words")
.withColumn("id_2", monotonically_increasing_id())
.select("id_2", explode("words").alias("word")))
(df1_.join(df2_, "word").groupBy("id_1", "id_2", "words").count()
.groupBy("id_1", "words").sum("count").drop("id_1").show())
+-----------------+----------+
| words|sum(count)|
+-----------------+----------+
| [hello, alice]| 2|
| [sample, text]| 3|
|[stack, overflow]| 1|
| [hello, world]| 2|
+-----------------+----------+
如果不需要中间值,可以简化为:
df1_.join(df2_, "word").groupBy("words").count().show()
+-----------------+-----+
| words|count|
+-----------------+-----+
| [hello, alice]| 2|
| [sample, text]| 3|
|[stack, overflow]| 1|
| [hello, world]| 2|
+-----------------+-----+
你可以省略添加ID。