我有一个格式如下的数据框。
movieId1 | genreList1 | movieId2 | genreList2
---------------------------------------------------------------
1 |[Adventure,Comedy] | 2 |[Adventure,Comedy]
1 |[Animation,Drama] | 3 |[War,Drama]
Dataframe架构是
StructType(
StructField(movieId1,IntegerType,false),
StructField(genres1,ArrayType(StringType,true),true),
StructField(movieId2,IntegerType,false),
StructField(genres2,ArrayType(StringType,true),true)
)
我想知道是否有任何方法可以使用新列创建新数据帧,该列是连续两种类型的Jaccard系数。
jaccardCoefficient(Set1, Set2) = (Set1 intersect Set2).size / (Set1 union Set2).size
movieId1 | movieId2 | jaccardcoeff
---------------------------------------------------------------
1 | 2 | 1
1 | 3 | 0.5
非常感谢任何帮助。感谢。
答案 0 :(得分:3)
给定此输入DataFrame:
+--------+-------------------+--------+-------------------+
|movieId1| genreList1|movieId2| genreList2|
+--------+-------------------+--------+-------------------+
| 1|[Adventure, Comedy]| 2|[Adventure, Comedy]|
| 1| [Animation, Drama]| 3| [War, Drama]|
+--------+-------------------+--------+-------------------+
with schema:
StructType(
StructField(movieId1,IntegerType,false),
StructField(genreList1,ArrayType(StringType,true),true),
StructField(movieId2,IntegerType,false),
StructField(genreList2,ArrayType(StringType,true),true))
您只需使用 UDF 来计算jaccard系数:
val jaccardCoefficient = udf {
(Set1: WrappedArray[String], Set2: WrappedArray[String]) =>
(Set1.toList.intersect(Set2.toList)).size.toDouble / (Set1.toList.union(Set2.toList)).distinct.size.toDouble }
使用此UDF,如下所示:
input.withColumn("jaccardcoeff", jaccardCoefficient($"genreList1",$"genreList2"))
获取您的不良输出:
+--------+--------+------------+
|movieId1|movieId2|jaccardcoeff|
+--------+--------+------------+
| 1| 2| 1|
| 1| 3| 0.33333|
+--------+--------+------------+