我有以下格式的RDD:
[(1,
(Rating(user=1, product=3, rating=0.99),
Rating(user=1, product=4, rating=0.91),
Rating(user=1, product=9, rating=0.68))),
(2,
(Rating(user=2, product=11, rating=1.01),
Rating(user=2, product=12, rating=0.98),
Rating(user=2, product=45, rating=0.97))),
(3,
(Rating(user=3, product=23, rating=1.01),
Rating(user=3, product=34, rating=0.99),
Rating(user=3, product=45, rating=0.98)))]
我一直无法找到使用map lambda等来处理这种命名数据的任何示例。 理想情况下,我希望输出为以下格式的数据帧:
User Ratings
1 3,0.99|4,0.91|9,0.68
2 11,1.01|12,0.98|45,0.97
3 23,1.01|34,0.99|45,0.98
任何指针都将不胜感激。请注意,评级数量是可变的,而不仅仅是3。
答案 0 :(得分:1)
将RDD定义为
from pyspark.mllib.recommendation import Rating
rdd = sc.parallelize([
(1,
(Rating(user=1, product=3, rating=0.99),
Rating(user=1, product=4, rating=0.91),
Rating(user=1, product=9, rating=0.68))),
(2,
(Rating(user=2, product=11, rating=1.01),
Rating(user=2, product=12, rating=0.98),
Rating(user=2, product=45, rating=0.97))),
(3,
(Rating(user=3, product=23, rating=1.01),
Rating(user=3, product=34, rating=0.99),
Rating(user=3, product=45, rating=0.98)))])
您mapValues
可以使用list
:
df = rdd.mapValues(list).toDF(["User", "Ratings"])
df.printSchema()
# root
# |-- User: long (nullable = true)
# |-- Ratings: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- user: long (nullable = true)
# | | |-- product: long (nullable = true)
# | | |-- rating: double (nullable = true)
或提供架构:
df = spark.createDataFrame(rdd, "struct<User:long,ratings:array<struct<user:long,product:long,rating:double>>>")
df.printSchema()
# root
# |-- User: long (nullable = true)
# |-- ratings: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- user: long (nullable = true)
# | | |-- product: long (nullable = true)
# | | |-- rating: double (nullable = true)
#
df.show()
# +----+--------------------+
# |User| ratings|
# +----+--------------------+
# | 1|[[1,3,0.99], [1,4...|
# | 2|[[2,11,1.01], [2,...|
# | 3|[[3,23,1.01], [3,...|
# +----+--------------------+
如果您要删除user
字段:
df_without_user = spark.createDataFrame(
rdd.mapValues(lambda xs: [x[1:] for x in xs]),
"struct<User:long,ratings:array<struct<product:long,rating:double>>>"
)
如果要将列格式化为单个字符串,则必须使用udf
from pyspark.sql.functions import udf
@udf
def format_ratings(ratings):
return "|".join(",".join(str(_) for _ in r[1:]) for r in ratings)
df.withColumn("ratings", format_ratings("ratings")).show(3, False)
# +----+-----------------------+
# |User|ratings |
# +----+-----------------------+
# |1 |3,0.99|4,0.91|9,0.68 |
# |2 |11,1.01|12,0.98|45,0.97|
# |3 |23,1.01|34,0.99|45,0.98|
# +----+-----------------------+
&#34;魔法&#34;工作原理:
迭代一系列评分
(... for r in ratings)
对于每个评级,请删除第一个字段并将剩余部分转换为str
(str(_) for _ in r[1:])
将评级中的字段与&#34;,&#34;连接起来分离器:
",".join(str(_) for _ in r[1:])
使用|
"|".join(",".join(str(_) for _ in r[1:]) for r in ratings)
替代实施:
@udf
def format_ratings(ratings):
return "|".join("{},{}".format(r.product, r.rating) for r in ratings)