Python按2列分组,但获取记录因不同列而异

时间:2019-08-20 09:03:57

标签: python python-2.7 apache-spark-sql pyspark-sql

我有一个包含3列的数据框,ZIP_CODE,TERR_NAME,STATE。对于给定的ZIP_CODE和TERR_NAME,只能有一个STATE代码。可能存在重复的记录,但没有相同的ZIP_CODE / TERR_NAME和2个不同的STATE的记录吗?我如何获取错误记录。

我试图按ZIP_CODE / TERR_NAME / STATE分组,但不了解获取这些错误记录的想法。

df1= sqlContext.createDataFrame([("81A01","TERR NAME 01","NJ"),("81A01","TERR NAME 01","CA"),("81A02","TERR NAME 02","NY"),("81A03","TERR NAME 03","NY"), ("81A03","TERR NAME 03","CA"), ("81A04","TERR NAME 04","FL"), ("81A05","TERR NAME 05","NJ"), ("81A06","TERR NAME 06","CA"), ("81A06","TERR NAME 06","CA")], ["zip_code","territory_name","state"])

df1.createOrReplaceTempView("df1_temp")
+--------+--------------+-----+ 
|zip_code|territory_name|state| 
+--------+--------------+-----+ 
| 81A01| TERR NAME 01| NJ| 
| 81A01| TERR NAME 01| CA| 
| 81A02| TERR NAME 02| NY| 
| 81A03| TERR NAME 03| NY| 
| 81A03| TERR NAME 03| CA| 
| 81A04| TERR NAME 04| FL| 
| 81A05| TERR NAME 05| NJ| 
| 81A06| TERR NAME 06| CA| 
| 81A06| TERR NAME 06| CA|
+--------+--------------+-----+

我需要一个没有这些代码的数据帧,即81A01、81A03,它们具有相同的邮政编码,地区名称,但使用spark.sql()具有不同的州代码。

期望新的DF:

+--------+--------------+-----+ 
|zip_code|territory_name|state| 
+--------+--------------+-----+ 
| 81A02| TERR NAME 02| NY| 
| 81A04| TERR NAME 04| FL| 
| 81A05| TERR NAME 05| NJ| 
| 81A06| TERR NAME 06| CA| 
| 81A06| TERR NAME 06| CA|
+--------+--------------+-----+

排除的邮政编码:

+--------+--------------+-----+ 
|zip_code|territory_name|state| 
+--------+--------------+-----+ 
| 81A01| TERR NAME 01| NJ| 
| 81A01| TERR NAME 01| CA| 
| 81A03| TERR NAME 03| NY| 
| 81A03| TERR NAME 03| CA| 
+--------+--------------+-----+

谢谢。

4 个答案:

答案 0 :(得分:1)

import pandas as pd 
data = {
    "zip_code":["81A01", "81A01", "81A02", "81A03", "81A03", "81A04", "81A05", 
    "81A06", "81A06"],
    "territory_name": ["TERR NAME 01", "TERR NAME 01", "TERR NAME 02", 
    "TERR NAME 03", "TERR NAME 03", "TERR NAME 04", "TERR NAME 05", 
    "TERR NAME 06", "TERR NAME 06"], 
    "state": ["NJ", "CA", "NY", "NY", "CA", "FL", "NJ", "CA", "CA"]  
}
df = pd.DataFrame(data)


duplicate = list(set([tuple(df[(df["zip_code"] == df["zip_code"][i]) & 
           (df["territory_name"] == df["territory_name"][i])].index) for i in 
           range(len(df))]))


for i in duplicate:
    if len(i) > 1:
        if not df["state"][i[0]] == df["state"][i[1]]:
            df = df.drop(i[0])
            df = df.drop(i[1])
print(df)

答案 1 :(得分:0)

for key,group_df in df.groupby(['zip_code','territory_name']):


    if len(group_df)>1:
      print(key)

希望以上代码能解决您的问题

答案 2 :(得分:0)

我自己找到了解决方案,并在此处发布,因此对其他人可能有用:

spark.sql("SELECT zip_code, territory_name, COUNT(distinct state) as COUNT FROM df1_temp GROUP BY zip_code, territory_name having COUNT>1").show()

+--------+--------------+-----+ 
|zip_code|territory_name|COUNT| 
+--------+--------------+-----+ 
| 81A03| TERR NAME 03| 2| 
| 81A01| TERR NAME 01| 2| 
+--------+--------------+-----+

谢谢

答案 3 :(得分:0)

  

使用Pyspark:   这里是您需要的代码段。

from pyspark.sql.functions import *
from pyspark.sql.window import Window

df1= sqlContext.createDataFrame([("81A01","TERR NAME 01","NJ"),("81A01","TERR NAME 01","CA"),("81A02","TERR NAME 02","NY"),("81A03","TERR NAME 03","NY"), ("81A03","TERR NAME 03","CA"), ("81A04","TERR NAME 04","FL"), ("81A05","TERR NAME 05","NJ"), ("81A06","TERR NAME 06","CA"), ("81A06","TERR NAME 06","CA")], ["zip_code","territory_name","state"])
df1_v1 = df1.withColumn("avg", collect_set("state").over(Window.partitionBy("zip_code","territory_name").orderBy("zip_code"))).filter(size(col("avg"))==1).orderBy(col("zip_code")).drop(col("avg"))

df1_v1.show()

让我知道您是否遇到与此问题相关的任何问题,并且是否可以解决您的目的,请接受答案。

enter image description here