Pyspark数据框:从pyspark数据框删除累积对

时间:2019-05-18 18:36:17

标签: python pyspark

我想删除具有相同ID的对,仅将其中一对保留在数据框中。

我也不能通过'id'删除重复项,因为我可能对同一个“ id”具有多个组合,而这些组合可能不是累积对示例: 我在python中尝试了以下操作,但不确定如何在pyspark中使用它,对您有所帮助。

m_f_1['value'] = m_f_1.apply(lambda x: str(x['value_x']) + str(x['value_y']) if x['value_x'] > x['value_y'] else str(x['value_y']) + str(x['value_x']), axis =1)

输入数据帧m_f_1为:

  id     value.x       value.y 
 100057    38953993985    38993095846 
 100057    38993095845    38953993985  
 100057    38993095845    38993095846
 100057    38993095846    38953993985
 100011    38989281716    38996868028   
 100011    38996868028    38989281716  
 100019    38916115350    38994231881  
 100019    38994231881    38916115350 

输出应为

  

头(res)

  id      value.x      value.y 
 100011    38989281716 38996868028 
 100019    38916115350 38994231881  
 100031    38911588267 38993358322 
 100057    38953993985 38993095846 
 100057    38993095845 38953993985  
 100057    38993095845 38993095846

2 个答案:

答案 0 :(得分:2)

您可以使用pyspark.sql.functions来实现。 pyspark.sql.functions.greatestpyspark.sql.functions.least分别取最大值和最小值。 pyspark.sql.functions.concat用于连接字符串。

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

sqlContext = SparkSession.builder.appName("test").enableHiveSupport().getOrCreate()
data = [(100057,38953993985,38993095846)
    , (100057,38993095845,38953993985)
    , (100057,38993095845,38993095846)
    , (100057,38993095846,38953993985)
    , (100011,38989281716,38996868028)
    , (100011,38996868028,38989281716)
    , (100019,38916115350,38994231881)
    , (100019,38994231881,38916115350)]
m_f_1 = sqlContext.createDataFrame(data, schema=['id','value_x','value_y'])

m_f_1 = m_f_1.withColumn('value', F.concat(F.greatest('value_x','value_y').cast('string')
                                           ,F.least('value_x','value_y').cast('string')))
m_f_1 = m_f_1.dropDuplicates(subset=['value']).drop('value').sort('id')
m_f_1.show(truncate=False)

+------+-----------+-----------+
|id    |value_x    |value_y    |
+------+-----------+-----------+
|100011|38989281716|38996868028|
|100019|38916115350|38994231881|
|100057|38993095845|38953993985|
|100057|38953993985|38993095846|
|100057|38993095845|38993095846|
+------+-----------+-----------+

答案 1 :(得分:1)

即使您希望从2列以上获得唯一性,这也应该起作用。

df = spark.createDataFrame([(100057,38953993985,38993095846),(100057,38993095845,38953993985),(100057,38993095845,38993095846),(100057,38993095846,38953993985),(100011,38989281716,38996868028),(100011,38996868028,38989281716),(100019,38916115350,38994231881),(100019,38994231881,38916115350)],['id','value_x','value_y'])


from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

def list_sort(x,y):
  return sorted([x,y])

udf_list_sort = udf(list_sort, ArrayType(IntegerType()))

spark.udf.register("udf_list_sort",udf_list_sort)

df1 = df.selectExpr("id","udf_list_sort(value_x,value_y) as value_x_y").distinct()


df1.selectExpr("id AS id",
              "value_x_y[0] AS value_x",
              "value_x_y[1] AS value_y").show()

#+------+---------+---------+
#|    id|  value_x|  value_y|
#+------+---------+---------+
#|100019|261409686|339526217|
#|100011|334576052|342162364|
#|100057|299288321|338390182|
#|100057|299288321|338390181|
#|100057|338390181|338390182|
#+------+---------+---------+