我是新手,我正在尝试查找有关已转换为两个单独的DataFrame的几个数据列表的特定信息。
两个DataFrame是:
Users: item_Details:
user_id | item_id item_id | item_name
----------------- ----------------------
1 | 123 123 | phone
2 | 223 223 | game
3 | 423 423 | foo
2 | 1223 1223 | bar
1 | 3213 3213 | foobar
我需要找到拥有50多个共同项目并按项目数量排序的所有用户对。不能有重复,这意味着只能有一组userId 1和userId 2。
结果需要如下所示:
user_id1 | user_id2 | count_of_items | list_of_items
-------------------------------------------------------------
1 | 2 | 51 | phone,foo,bar,foobar
答案 0 :(得分:1)
这是一种方法:
item pairs
common items
生成item pairs
如下所示:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
val users = Seq(
(1, 123), (1, 223), (1, 423),
(2, 123), (2, 423), (2, 1223), (2, 3213),
(3, 223), (3, 423), (3, 1223), (3, 3213),
(4, 123), (4, 1223), (4, 3213)
).toDF("user_id", "item_id")
val item_details = Seq(
(123, "phone"), (223, "game"), (423, "foo"), (1223, "bar"), (3213, "foobar")
)toDF("item_id", "item_name")
val commonItems = udf( (itemPairs: Seq[Row]) =>
itemPairs.collect{ case Row(a: Int, b: Int) if a == b => a }
)
val commonLimit = 2 // Replace this with any specific common item count
val user_common_items =
users.as("u1").join(users.as("u2"), $"u1.user_id" < $"u2.user_id").
groupBy($"u1.user_id", $"u2.user_id").agg(
collect_set(
struct($"u1.item_id".as("ui1"), $"u2.item_id".as("ui2"))
).as("item_pairs")).
withColumn("common_items", commonItems($"item_pairs")).
drop("item_pairs").
where(size($"common_items") > commonLimit)
user_common_items.show(false)
// +-------+-------+-----------------+
// |user_id|user_id|common_items |
// +-------+-------+-----------------+
// |2 |3 |[423, 3213, 1223]|
// |2 |4 |[3213, 123, 1223]|
// +-------+-------+-----------------+
如果需要通用的商品名称而不是商品ID,则可以在上述步骤中加入item_details
以汇总商品名称;或者,您可以爆炸现有的common item ids
并按用户对加入item_details
和collect_list
聚合:
user_common_items.
withColumn("item_id", explode($"common_items")).
join(item_details, Seq("item_id")).
groupBy($"u1.user_id", $"u2.user_id").agg(collect_list($"item_name").as("common_items")).
withColumn("item_count", size($"common_items")).
show
// +-------+-------+--------------------+----------+
// |user_id|user_id| common_items|item_count|
// +-------+-------+--------------------+----------+
// | 2| 3| [foo, foobar, bar]| 3|
// | 2| 4|[foobar, phone, bar]| 3|
// +-------+-------+--------------------+----------+
答案 1 :(得分:1)
另一种解决方案,不使用UDF。由于我们需要公共项目,因此可以在joinExprs本身中进行匹配。检查一下
% ./marshal
JSON = {"Name":"Alice","Age":29}
结果
val users = Seq(
(1, 123), (1, 223), (1, 423),
(2, 123), (2, 423), (2, 1223), (2, 3213),
(3, 223), (3, 423), (3, 1223), (3, 3213),
(4, 123), (4, 1223), (4, 3213)
).toDF("user_id", "item_id")
val items = Seq(
(123, "phone"), (223, "game"), (423, "foo"), (1223, "bar"), (3213, "foobar")
)toDF("item_id", "item_name")
val common_items =
users.as("t1").join(users.as("t2"),$"t1.user_id" < $"t2.user_id" and $"t1.item_id" === $"t2.item_id" )
.join(items.as("it"),$"t1.item_id"===$"it.item_id","inner")
.groupBy($"t1.user_id",$"t2.user_id")
.agg(collect_set('item_name).as("items"))
.filter(size('items)>2) // change here for count
.withColumn("size",size('items))
common_items.show(false)