PySpark:将一个dataFrame中的数组值与另一个dataFrame中的数组值进行比较以获得交集

时间:2017-08-01 18:51:19

标签: apache-spark pyspark apache-spark-sql

我有以下两个DataFrame:

["hello","world"]
["stack","overflow"]
["hello","alice"]
["sample","text"]

DF1:

["big","world"]
["sample","overflow","alice","text","bob"]
["hello", "sample"]

DF2:

["hello","world"]

对于df1中的每一行,我想计算数组中所有单词出现在df2中的次数。

例如,df1中的第一行是["hello","world"]。现在,我想检查df2是否为| ARRAY | INTERSECTION | LEN(INTERSECTION)| |["big","world"] |["world"] | 1 | |["sample","overflow","alice","text","bob"] |[] | 0 | |["hello","sample"] |["hello"] | 1 | 与df2中每一行的交集。

sum(len(interesection))

现在,我想要返回 ARRAY INTERSECTION_TOTAL | ["hello","world"] | 2 | | ["stack","overflow"] | 1 | | ["hello","alice"] | 2 | | ["sample","text"] | 3 | 。最终我希望得到的df1看起来像这样:

df1结果:

=IF(C1-INT(C1)<=0.2,0,1)+INT(C1)

我该如何解决这个问题?

1 个答案:

答案 0 :(得分:1)

我首先关注避免使用笛卡尔积。我试图爆炸并加入

from pyspark.sql.functions import explode, monotonically_increasing_id

df1_ = (df1.toDF("words")
  .withColumn("id_1", monotonically_increasing_id())
  .select("*", explode("words").alias("word")))

df2_ = (df2.toDF("words")
    .withColumn("id_2", monotonically_increasing_id())
    .select("id_2", explode("words").alias("word")))

(df1_.join(df2_, "word").groupBy("id_1", "id_2", "words").count()
    .groupBy("id_1", "words").sum("count").drop("id_1").show())
+-----------------+----------+                                                  
|            words|sum(count)|
+-----------------+----------+
|   [hello, alice]|         2|
|   [sample, text]|         3|
|[stack, overflow]|         1|
|   [hello, world]|         2|
+-----------------+----------+

如果不需要中间值,可以简化为:

df1_.join(df2_, "word").groupBy("words").count().show()
+-----------------+-----+                                                       
|            words|count|
+-----------------+-----+
|   [hello, alice]|    2|
|   [sample, text]|    3|
|[stack, overflow]|    1|
|   [hello, world]|    2|
+-----------------+-----+

你可以省略添加ID。