数据帧中两列的交集和联合

时间:2017-02-22 23:16:35

标签: apache-spark apache-spark-sql spark-dataframe

我有一个格式如下的数据框。

movieId1 | genreList1          | movieId2 | genreList2
---------------------------------------------------------------
1        |[Adventure,Comedy]   | 2        |[Adventure,Comedy]
1        |[Animation,Drama]    | 3        |[War,Drama]

Dataframe架构是

 StructType(
     StructField(movieId1,IntegerType,false),    
     StructField(genres1,ArrayType(StringType,true),true), 
     StructField(movieId2,IntegerType,false), 
     StructField(genres2,ArrayType(StringType,true),true)
 )

我想知道是否有任何方法可以使用新列创建新数据帧,该列是连续两种类型的Jaccard系数。

jaccardCoefficient(Set1, Set2) = (Set1 intersect Set2).size / (Set1 union Set2).size

movieId1 | movieId2 | jaccardcoeff
---------------------------------------------------------------
1        | 2        | 1
1        | 3        | 0.5

非常感谢任何帮助。感谢。

1 个答案:

答案 0 :(得分:3)

给定此输入DataFrame:

+--------+-------------------+--------+-------------------+
|movieId1|         genreList1|movieId2|         genreList2|
+--------+-------------------+--------+-------------------+
|       1|[Adventure, Comedy]|       2|[Adventure, Comedy]|
|       1| [Animation, Drama]|       3|       [War, Drama]|
+--------+-------------------+--------+-------------------+

with schema:

StructType(
   StructField(movieId1,IntegerType,false),    
   StructField(genreList1,ArrayType(StringType,true),true),    
   StructField(movieId2,IntegerType,false),     
   StructField(genreList2,ArrayType(StringType,true),true))

您只需使用 UDF 来计算jaccard系数:

val jaccardCoefficient = udf { 
   (Set1: WrappedArray[String], Set2: WrappedArray[String]) => 
     (Set1.toList.intersect(Set2.toList)).size.toDouble / (Set1.toList.union(Set2.toList)).distinct.size.toDouble }

使用此UDF,如下所示:

 input.withColumn("jaccardcoeff", jaccardCoefficient($"genreList1",$"genreList2"))

获取您的不良输出:

+--------+--------+------------+
|movieId1|movieId2|jaccardcoeff|
+--------+--------+------------+
|       1|       2|           1|
|       1|       3|     0.33333|
+--------+--------+------------+