将重复记录合并到pyspark数据框中的单个记录

时间:2018-12-21 08:47:58

标签: python-2.7 pyspark apache-spark-sql

我有一个包含重复行的数据框,我想将它们合并为具有所有不同列的单个记录。

我的代码示例如下:

df1= sqlContext.createDataFrame([("81A01","TERR NAME 01","NJ","",""),("81A01","TERR NAME 01","","NY",""),("81A01","TERR NAME 01","","","LA"),("81A02","TERR NAME 01","CA","",""),("81A02","TERR NAME 01","","","NY")], ["zip_code","territory_name","state","state1","state2"])

结果数据帧如下:

df1.show()
+--------+--------------+-----+------+------+
|zip_code|territory_name|state|state1|state2|
+--------+--------------+-----+------+------+
|   81A01|  TERR NAME 01|   NJ|      |      |
|   81A01|  TERR NAME 01|     |    NY|      |
|   81A01|  TERR NAME 01|     |      |    LA|
|   81A02|  TERR NAME 01|   CA|      |      |
|   81A02|  TERR NAME 01|     |      |    NY|
+--------+--------------+-----+------+------+

我需要基于zip_code合并/合并重复记录,并在一行中获取所有不同的状态值。

预期结果:

+--------+--------------+-----+------+------+
|zip_code|territory_name|state|state1|state2|
+--------+--------------+-----+------+------+
|   81A01|  TERR NAME 01|   NJ|    NY|    LA|
|   81A02|  TERR NAME 01|   CA|      |    LA|
+--------+--------------+-----+------+------+

是pyspark的新手,不确定如何使用组/联接。有人可以提供代码帮助吗?

2 个答案:

答案 0 :(得分:2)

如果您确定每个邮政编码区域组合只有1个状态,1个state1和1个state2,则可以使用以下代码。如果分组的数据中有一个字符串,则max函数将使用该字符串,因为非空字符串的值较高(可能是ASCII方式),因此空字符串""

from pyspark.sql.types import *
from pyspark.sql.functions import *
df1= sqlContext.createDataFrame([("81A01","TERR NAME 01","NJ","",""),("81A01","TERR NAME 01","","NY",""),("81A01","TERR NAME 01","","","LA"),("81A02","TERR NAME 01","CA","",""),("81A02","TERR NAME 01","","","NY")], ["zip_code","territory_name","state","state1","state2"])

df1.groupBy("zip_code","territory_name").agg(max("state").alias("state"),max("state1").alias("state1"),max("state2").alias("state2")).show()

结果:

+--------+--------------+-----+------+------+
|zip_code|territory_name|state|state1|state2|
+--------+--------------+-----+------+------+
|   81A02|  TERR NAME 01|   CA|      |    NY|
|   81A01|  TERR NAME 01|   NJ|    NY|    LA|
+--------+--------------+-----+------+------+

答案 1 :(得分:1)

注意:对于zip_codeterritory_name的任何唯一记录,如果在任何状态列下有多个条目,则它们将是concatenated

一些解释:在这段代码中,我使用了RDDs。我首先将每条记录分为两个tuples,其中tuple1keytuple2value。然后,我减少keyx对应于tuple1中的(zip_code, territory_name),并且tuple2包含3个状态列。 tuple1被当作key,是因为我们要group by zip_codeterritory_name的不同值。因此,每个(81A01,TERR NAME 01)(81A02,TERR NAME 01)之类的不同对都是一个key,在此基础上我们reduceReduce意味着一次获取每两个值并对其进行一些operation,然后对这个结果和下一个元素重复相同的operation,直到耗尽整个元组为止。

因此,用+ operation减少(1,2,3,4,5)将是-1+2=3,然后是3+3=6并进行{{1} } +直到到达最后一个元素。因此,operation,最后是6+4=10。由于元组在5处结束,因此结果为15。这就是10+5=15reduce操作一起工作的方式。由于在这里我们有+而不是strings,所以将发生numbers的串联。

A+B=AB