我有两个DataFrames推荐和电影。建议中的列rec1-rec3表示电影数据帧中的电影ID。
val recommendations: DataFrame = List(
(0, 1, 2, 3),
(1, 2, 3, 4),
(2, 1, 3, 4)).toDF("id", "rec1", "rec2", "rec3")
val movies = List(
(1, "the Lord of the Rings"),
(2, "Star Wars"),
(3, "Star Trek"),
(4, "Pulp Fiction")).toDF("id", "name")
我想要的是什么:
+---+------------------------+------------+------------+
| id| rec1| rec2| rec3|
+---+------------------------+------------+------------+
| 0| the Lord of the Rings| Star Wars| Star Trek|
| 1| Star Wars| Star Trek|Pulp Fiction|
| 2| the Lord of the Rings| Star Trek| Star Trek|
+---+------------------------+------------+------------+
答案 0 :(得分:5)
我们还可以使用函数stack()
和pivot()
来达到预期的输出,只加入两个数据帧。
// First rename 'id' column to 'ids' avoid duplicate names further downstream
val moviesRenamed = movies.withColumnRenamed("id", "ids")
recommendations.select($"id", expr("stack(3, 'rec1', rec1, 'rec2', rec2, 'rec3', rec3) as (rec, movie_id)"))
.where("rec is not null")
.join(moviesRenamed, col("movie_id") === moviesRenamed.col("ids"))
.groupBy("id")
.pivot("rec")
.agg(first("name"))
.show()
+---+--------------------+---------+------------+
| id| rec1| rec2| rec3|
+---+--------------------+---------+------------+
| 0|the Lord of the R...|Star Wars| Star Trek|
| 1| Star Wars|Star Trek|Pulp Fiction|
| 2|the Lord of the R...|Star Trek|Pulp Fiction|
+---+--------------------+---------+------------+
答案 1 :(得分:1)
我明白了。您应该像在SQL中一样为列创建别名。
+---+--------------------+---------+------------+
| id| n1| n2| n3|
+---+--------------------+---------+------------+
| 0|the Lord of the R...|Star Wars| Star Trek|
| 1| Star Wars|Star Trek|Pulp Fiction|
| 2|the Lord of the R...|Star Trek|Pulp Fiction|
+---+--------------------+---------+------------+
查询将导致
related={}
unrelated={}
new_rel={}
f=open('filename.csv')
line=f.readline()
while line: #iterate until EOF is reached
l=line.split(',').rstrip() #split through comma and remove newline character
if int(l[0]) == 1: # if pair relation is 1
if l[1] not in related.keys(): #if key is not present
related[int(l[1])]=[] #generate key which contains a list
related[int(l[1])].append(int(l[2])) #append number to list
else : # if pair relation is 1
if l[1] not in unrelated.keys():
unrelated[int(l[1])]=[]
unrelated[int(l[1])].append(int(l[2]))
line=f.readline() #read next line
f.close()