我想删除具有相同ID的对,仅将其中一对保留在数据框中。
我也不能通过'id'删除重复项,因为我可能对同一个“ id”具有多个组合,而这些组合可能不是累积对示例: 我在python中尝试了以下操作,但不确定如何在pyspark中使用它,对您有所帮助。
m_f_1['value'] = m_f_1.apply(lambda x: str(x['value_x']) + str(x['value_y']) if x['value_x'] > x['value_y'] else str(x['value_y']) + str(x['value_x']), axis =1)
输入数据帧m_f_1为:
id value.x value.y
100057 38953993985 38993095846
100057 38993095845 38953993985
100057 38993095845 38993095846
100057 38993095846 38953993985
100011 38989281716 38996868028
100011 38996868028 38989281716
100019 38916115350 38994231881
100019 38994231881 38916115350
输出应为
头(res)
id value.x value.y
100011 38989281716 38996868028
100019 38916115350 38994231881
100031 38911588267 38993358322
100057 38953993985 38993095846
100057 38993095845 38953993985
100057 38993095845 38993095846
答案 0 :(得分:2)
您可以使用pyspark.sql.functions
来实现。 pyspark.sql.functions.greatest
和pyspark.sql.functions.least
分别取最大值和最小值。 pyspark.sql.functions.concat
用于连接字符串。
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
sqlContext = SparkSession.builder.appName("test").enableHiveSupport().getOrCreate()
data = [(100057,38953993985,38993095846)
, (100057,38993095845,38953993985)
, (100057,38993095845,38993095846)
, (100057,38993095846,38953993985)
, (100011,38989281716,38996868028)
, (100011,38996868028,38989281716)
, (100019,38916115350,38994231881)
, (100019,38994231881,38916115350)]
m_f_1 = sqlContext.createDataFrame(data, schema=['id','value_x','value_y'])
m_f_1 = m_f_1.withColumn('value', F.concat(F.greatest('value_x','value_y').cast('string')
,F.least('value_x','value_y').cast('string')))
m_f_1 = m_f_1.dropDuplicates(subset=['value']).drop('value').sort('id')
m_f_1.show(truncate=False)
+------+-----------+-----------+
|id |value_x |value_y |
+------+-----------+-----------+
|100011|38989281716|38996868028|
|100019|38916115350|38994231881|
|100057|38993095845|38953993985|
|100057|38953993985|38993095846|
|100057|38993095845|38993095846|
+------+-----------+-----------+
答案 1 :(得分:1)
即使您希望从2列以上获得唯一性,这也应该起作用。
df = spark.createDataFrame([(100057,38953993985,38993095846),(100057,38993095845,38953993985),(100057,38993095845,38993095846),(100057,38993095846,38953993985),(100011,38989281716,38996868028),(100011,38996868028,38989281716),(100019,38916115350,38994231881),(100019,38994231881,38916115350)],['id','value_x','value_y'])
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
def list_sort(x,y):
return sorted([x,y])
udf_list_sort = udf(list_sort, ArrayType(IntegerType()))
spark.udf.register("udf_list_sort",udf_list_sort)
df1 = df.selectExpr("id","udf_list_sort(value_x,value_y) as value_x_y").distinct()
df1.selectExpr("id AS id",
"value_x_y[0] AS value_x",
"value_x_y[1] AS value_y").show()
#+------+---------+---------+
#| id| value_x| value_y|
#+------+---------+---------+
#|100019|261409686|339526217|
#|100011|334576052|342162364|
#|100057|299288321|338390182|
#|100057|299288321|338390181|
#|100057|338390181|338390182|
#+------+---------+---------+