Spark上的Scala [自连接后过滤掉重复的行]

时间:2019-01-25 10:47:57

标签: python scala apache-spark pyspark user-defined-functions

我尝试使用python过滤数据

|name_x | age_x | salary_x | name_y | age_y | salary_y | age_diff
| James | 23    | 200000   | Jack   | 24    | 210040   |  1
| Jack  | 24    | 210040   | James  | 23    | 200000   |  1
| Irene | 25    | 200012   | John   | 25    | 210000   |  0
| Johny | 26    | 21090    | Elon   | 29    | 210012   |  3
| Josh  | 24    | 21090    | David  | 23    | 213012   |  1
| John  | 25    | 210000   | Irene  | 25    | 200012   |  0

行1 行2 也是重复项 第3行第6行
重复,作为
name_x == name_y age_x == age_y salary_x == salary_y
并且不考虑 age_diff ,即< strong>输出。   我需要将它们过滤掉,[重复行之一]。

需要最终输出为:如下所示,过滤掉重复项

|name_x | age_x | salary_x | name_y | age_y | salary_y | age_diff
| James | 23    | 200000   | Jack   | 24    | 210040   |  1
| Irene | 25    | 200012   | John   | 25    | 210000   |  0
| Johny | 26    | 21090    | Elon   | 29    | 210012   |  3
| Josh  | 24    | 21090    | David  | 23    | 213012   |  1

在python上实现了以下功能,它返回重复项的索引,而且速度太慢。

def duplicate_index(df):
    length = len(df.columns) - 1 # -1 for removing the time difference
    length = length//2
    nrows = df.shape[0]
    duplicate_index = [] 
    for row in range(nrows-1):
        count  = 0
        for frow in range(row+1,nrows):
            if (list(df.iloc[row][:length]) == list(df.iloc[frow][length:-1])):
                if (list(df.iloc[row][length:-1]) == list(df.iloc[frow][:length])):
                    duplicate_index.append(frow)
                    #print(row, frow)
                    count = count + 1
            if count == 1:
                break
    return duplicate_index
del_index = duplicate_index(df)
final_df  = df.drop(index = del_index)

但是现在我不得不在Scala上使用spark进行这些操作,是否存在,采用任何更快的方法来处理这些过滤器或类似python中的 shift 。或Scala上的窗口

2 个答案:

答案 0 :(得分:2)

您可以将附加条件添加到仅保留两行之一的联接中,例如name_x

示例数据框:

  val rowsRdd: RDD[Row] = spark.sparkContext.parallelize(
    Seq(
      Row(1, "James",  1, 10),
      Row(1, "Jack",   2, 20),
      Row(2, "Tom",    3, 30),
      Row(2, "Eva",    4, 40)
    )
  )

  val schema: StructType = new StructType()
    .add(StructField("id",      IntegerType,  false))
    .add(StructField("name",    StringType,  false))
    .add(StructField("age",     IntegerType, false))
    .add(StructField("salary",  IntegerType, false))

  val df0: DataFrame = spark.createDataFrame(rowsRdd, schema)

  df0.sort("id").show()

哪个给:

+---+-----+---+------+
| id| name|age|salary|
+---+-----+---+------+
|  1|James|  1|    10|
|  1| Jack|  2|    20|
|  2|  Eva|  4|    40|
|  2|  Tom|  3|    30|
+---+-----+---+------+

重命名数据框的列:

val df1 = df0.columns.foldLeft(df0)((acc, x) => acc.withColumnRenamed(x, x+"_x"))
val df2 = df0.columns.foldLeft(df0)((acc, x) => acc.withColumnRenamed(x, x+"_y"))

然后加入以下三个条件:

val df3 = df1.join(df2,
    col("id_x") === col("id_y") and
    col("name_x") =!= col("name_y") and
    col("name_x") < col("name_y"),
    "inner")
df3.show()

返回

+----+------+-----+--------+----+------+-----+--------+                         
|id_x|name_x|age_x|salary_x|id_y|name_y|age_y|salary_y|
+----+------+-----+--------+----+------+-----+--------+
|   1|  Jack|    2|      20|   1| James|    1|      10|
|   2|   Eva|    4|      40|   2|   Tom|    3|      30|
+----+------+-----+--------+----+------+-----+--------+

根据在数据中定义重复项的方式,区分两个重复项的条件会有所不同。

答案 1 :(得分:1)

我认为astro_asz的答案是更简洁的方法,但是出于完整性考虑,以下是使用窗口的方法:

编辑:我更改了数据集,以使两个人具有相同的名称,并根据每行的内容添加了唯一的ID

val people = Seq(
  ("1", "James", 23, 200000),
  ("1", "James", 24, 210040),  // two people with same name
  ("2", "Irene", 25, 200012),
  ("2", "John",  25, 210000),
  ("3", "Johny", 26,  21090),
  ("3", "Elon",  29, 200000),
  ("4", "Josh",  24, 200000),
  ("4", "David", 23, 200000))

val columns = Seq("ID", "name", "age", "salary")
val df = people.toDF(columns:_*)

// In general you want to use the primary key from the underlying data store
// as your unique keys.  If for some weird reason the primary key is not 
// available or does not exist, you can try to create your own.  This
// is fraught with danger.  If you are willing to make the (dangerous)
// assumption a unique row is enough to uniquely identify the entity in
// that row, you can use a md5 hash of the contents of the row to create
// your id
val withKey = df.withColumn("key", md5(concat(columns.map(c => col(c)):_*)))

val x = withKey.toDF(withKey.columns.map(c => if (c == "ID") c else "x_" + c):_*)
val y = withKey.toDF(withKey.columns.map(c => if (c == "ID") c else "y_" + c):_*)

val partition = Window.partitionBy("ID").orderBy("x_key")
val df2 = x.join(y, Seq("ID"))
  .where('x_key =!= 'y_key)
  .withColumn("rank", rank over partition)
  .where('rank === 1)
  .drop("rank", "x_key", "y_key")

df2.show
/*-+------+-----+--------+------+-----+--------+                         
|ID|x_name|x_age|x_salary|y_name|y_age|y_salary|
+--+------+-----+--------+------+-----+--------+
| 3|  Elon|   29|  200000| Johny|   26|   21090|
| 1| James|   24|  210040| James|   23|  200000|
| 4| David|   23|  200000|  Josh|   24|  200000|
| 2| Irene|   25|  200012|  John|   25|  210000|
+--+------+-----+--------+------+-----+-------*/