如何将一个DataFrame中的多个列与另一个DataFrame联接

时间:2018-01-29 08:27:48

标签: scala apache-spark apache-spark-sql

我有两个DataFrames推荐和电影。建议中的列rec1-rec3表示电影数据帧中的电影ID。

val recommendations: DataFrame = List(
        (0, 1, 2, 3),
        (1, 2, 3, 4),
        (2, 1, 3, 4)).toDF("id", "rec1", "rec2", "rec3")

val movies = List(
        (1, "the Lord of the Rings"),
        (2, "Star Wars"),
        (3, "Star Trek"),
        (4, "Pulp Fiction")).toDF("id", "name")

我想要的是什么:

+---+------------------------+------------+------------+
| id|                    rec1|        rec2|        rec3|
+---+------------------------+------------+------------+
|  0|   the Lord of the Rings|   Star Wars|   Star Trek|
|  1|               Star Wars|   Star Trek|Pulp Fiction|
|  2|   the Lord of the Rings|   Star Trek|   Star Trek|
+---+------------------------+------------+------------+

2 个答案:

答案 0 :(得分:5)

我们还可以使用函数stack()pivot()来达到预期的输出,只加入两个数据帧。

// First rename 'id' column to 'ids' avoid duplicate names further downstream
val moviesRenamed = movies.withColumnRenamed("id", "ids")

recommendations.select($"id", expr("stack(3, 'rec1', rec1, 'rec2', rec2, 'rec3', rec3) as (rec, movie_id)"))
  .where("rec is not null")
  .join(moviesRenamed, col("movie_id") === moviesRenamed.col("ids"))
  .groupBy("id")
  .pivot("rec")
  .agg(first("name"))
  .show()
+---+--------------------+---------+------------+
| id|                rec1|     rec2|        rec3|
+---+--------------------+---------+------------+
|  0|the Lord of the R...|Star Wars|   Star Trek|
|  1|           Star Wars|Star Trek|Pulp Fiction|
|  2|the Lord of the R...|Star Trek|Pulp Fiction|
+---+--------------------+---------+------------+

答案 1 :(得分:1)

我明白了。您应该像在SQL中一样为列创建别名。

+---+--------------------+---------+------------+
| id|                  n1|       n2|          n3|
+---+--------------------+---------+------------+
|  0|the Lord of the R...|Star Wars|   Star Trek|
|  1|           Star Wars|Star Trek|Pulp Fiction|
|  2|the Lord of the R...|Star Trek|Pulp Fiction|
+---+--------------------+---------+------------+

查询将导致

related={}
unrelated={}
new_rel={}

f=open('filename.csv')

line=f.readline()
while line:      #iterate until EOF is reached 
    l=line.split(',').rstrip()   #split through comma and remove newline character
    if int(l[0]) == 1:  # if pair relation is 1
            if l[1] not in related.keys():  #if key is not present
                 related[int(l[1])]=[]           #generate key which contains a list
            related[int(l[1])].append(int(l[2])) #append number to list
    else :   # if pair relation is 1
       if l[1] not in unrelated.keys():
          unrelated[int(l[1])]=[]
       unrelated[int(l[1])].append(int(l[2]))
    line=f.readline()                 #read next line
f.close()