在一个列中查找重复项,在另一列中查找非重复项

时间:2018-11-16 21:04:30

标签: python pandas

我正在努力获取数据集并输出结果,该结果在一个列中查找重复信息,而在另一列中查找非重复项。如果说第0列和第2列是完全重复的列,则我不关心数据集,仅当第0列中的行在第2列中包含多个值的条目时才行。而且,如果是这样,我希望匹配列0的行。

我首先使用concat将数据集缩小为具有重复项的行。我的问题是现在尝试仅获取第2列不同的行。

我的示例数据集是:

Pattern or URI,Route Filter Clause,Partition,Pattern Usage,Owning Object,Owning Object Partition,Cluster ID,Catalog Name,Route String,Device Name,Device Description
"11111",,Prod_P,Device,"11111",Prod_P,,,,SEPFFFF0723AFE8,device1
"11111",,Prod_P,Device,"11111",Prod_P,,,,SEPFFFF862FAF74,device2
"11111",,Prod_P,Device,"11111",Prod_P,,,,SEPFFFFF2A8AA38,device3
"11111",,Prod_P,Device,"11111",Prod_P,,,,SEPFFFFD2C0A2C6,device4
"22334",,Prod_P,Device,"22334",Prod_P,,,,SEPFFFFCF87AB31,device5
"33333",,Prod_P,Device,"33333",Prod_P,,,,SEPFFFFCF87AAEA,device6
"33333",,Dummy_P,Device,"33333",Dummy_P,,,,SEPFFFF18FF65A0,device7
"33333",,Prod_P,Device,"33333",Prod_P,,,,SEPFFFFCFCCAABB,device8

在此集合中,我希望得到最后三行为“ 33333”的结果,因为它们在第2列中具有多个类型的值。“ 11111”仅与Prod_P匹配,因此我不在乎

import pandas as pd
ignorelist = []
inputfile = "pandas-problem-data.txt"
data = pd.read_csv(inputfile)
data.columns = data.columns.str.replace(' ','_')
data = pd.concat(g for _, g in data.groupby("Pattern_or_URI") if len(g) > 1)
data = data.loc[(data["Pattern_Usage"]=="Device"), ["Pattern_or_URI","Partition","Pattern_Usage","Device_Name","Device_Description"]]
new_rows = []
tempdup = pd.DataFrame()
for i, row in data.iterrows():
    if row["Pattern_or_URI"] in ignorelist:
        continue
    ignorelist.append(row["Pattern_or_URI"])
    # testdup = pd.concat(h for _, h in (data.loc[(data["Pattern_or_URI"]==row["Pattern_or_URI"], ["Pattern_or_URI","Partition","Pattern_Usage","Device_Name","Device_Description"])]).groupby("Partition") if len(h) > 1)
    # print(data.loc[(data["Pattern_or_URI"]==row["Pattern_or_URI"], ["Pattern_or_URI","Partition","Pattern_Usage","Device_Name","Device_Description"])])
    newrow = data.loc[(data["Pattern_or_URI"]==row["Pattern_or_URI"], ["Pattern_or_URI","Partition","Pattern_Usage","Device_Name","Device_Description"])]

如果我取消注释试图在同一行中使用“分区”> 1查找条目的行,则会出现错误ValueError: No objects to concatenate。我知道它会通过第一个迭代,而打印语句不会被注释。

是否有更简便或更好的方法?我是熊猫的新手,并一直在想可能有一种方法可以找到我尚未想到的方法。 谢谢。

所需的输出:

Pattern or URI,Route Filter Clause,Partition,Pattern Usage,Owning Object,Owning Object Partition,Cluster ID,Catalog Name,Route String,Device Name,Device Description
"33333",,Prod_P,Device,"33333",Prod_P,,,,SEPFFFFCF87AAEA,device6
"33333",,Dummy_P,Device,"33333",Dummy_P,,,,SEPFFFF18FF65A0,device7
"33333",,Prod_P,Device,"33333",Prod_P,,,,SEPFFFFCFCCAABB,device8

2 个答案:

答案 0 :(得分:2)

我认为您正在寻找重复的商品会有点误导。这确实是一个分组问题。

您想在Pattern or URI中找到与Partition系列中多个唯一值相对应的相同值的组。


transform + nunique

s = df.groupby('Pattern or URI')['Partition'].transform('nunique').gt(1)
df.loc[s]

   Pattern or URI  Route Filter Clause Partition Pattern Usage  Owning Object Owning Object Partition  Cluster ID  Catalog Name  Route String      Device Name Device Description
5           33333                  NaN    Prod_P        Device          33333                  Prod_P         NaN           NaN           NaN  SEPFFFFCF87AAEA            device6
6           33333                  NaN   Dummy_P        Device          33333                 Dummy_P         NaN           NaN           NaN  SEPFFFF18FF65A0            device7
7           33333                  NaN    Prod_P        Device          33333                  Prod_P         NaN           NaN           NaN  SEPFFFFCFCCAABB            device8

答案 1 :(得分:-1)

Using df.drop_duplicates() as follows:

df=pd.DataFrame({'a':[111,111,111,222,222,333,333,333], 
                 'b':['a','a','a','b','b','a','b','c'],
                 'c':[12,13,14,15,61,71,81,19]})
df

    a   b   c
0   111 a   12
1   111 a   13
2   111 a   14
3   222 b   15
4   222 b   61
5   333 a   71
6   333 b   81
7   333 c   19

df1=df.drop_duplicates(['a','b'],keep=False)

df1

    a   b   c
5   333 a   71
6   333 b   81
7   333 c   19

Note, instead of assigning it to a new DF, you can add inplace=True to apply it to the original