我正在努力获取数据集并输出结果,该结果在一个列中查找重复信息,而在另一列中查找非重复项。如果说第0列和第2列是完全重复的列,则我不关心数据集,仅当第0列中的行在第2列中包含多个值的条目时才行。而且,如果是这样,我希望匹配列0的行。
我首先使用concat将数据集缩小为具有重复项的行。我的问题是现在尝试仅获取第2列不同的行。
我的示例数据集是:
Pattern or URI,Route Filter Clause,Partition,Pattern Usage,Owning Object,Owning Object Partition,Cluster ID,Catalog Name,Route String,Device Name,Device Description
"11111",,Prod_P,Device,"11111",Prod_P,,,,SEPFFFF0723AFE8,device1
"11111",,Prod_P,Device,"11111",Prod_P,,,,SEPFFFF862FAF74,device2
"11111",,Prod_P,Device,"11111",Prod_P,,,,SEPFFFFF2A8AA38,device3
"11111",,Prod_P,Device,"11111",Prod_P,,,,SEPFFFFD2C0A2C6,device4
"22334",,Prod_P,Device,"22334",Prod_P,,,,SEPFFFFCF87AB31,device5
"33333",,Prod_P,Device,"33333",Prod_P,,,,SEPFFFFCF87AAEA,device6
"33333",,Dummy_P,Device,"33333",Dummy_P,,,,SEPFFFF18FF65A0,device7
"33333",,Prod_P,Device,"33333",Prod_P,,,,SEPFFFFCFCCAABB,device8
在此集合中,我希望得到最后三行为“ 33333”的结果,因为它们在第2列中具有多个类型的值。“ 11111”仅与Prod_P匹配,因此我不在乎
import pandas as pd
ignorelist = []
inputfile = "pandas-problem-data.txt"
data = pd.read_csv(inputfile)
data.columns = data.columns.str.replace(' ','_')
data = pd.concat(g for _, g in data.groupby("Pattern_or_URI") if len(g) > 1)
data = data.loc[(data["Pattern_Usage"]=="Device"), ["Pattern_or_URI","Partition","Pattern_Usage","Device_Name","Device_Description"]]
new_rows = []
tempdup = pd.DataFrame()
for i, row in data.iterrows():
if row["Pattern_or_URI"] in ignorelist:
continue
ignorelist.append(row["Pattern_or_URI"])
# testdup = pd.concat(h for _, h in (data.loc[(data["Pattern_or_URI"]==row["Pattern_or_URI"], ["Pattern_or_URI","Partition","Pattern_Usage","Device_Name","Device_Description"])]).groupby("Partition") if len(h) > 1)
# print(data.loc[(data["Pattern_or_URI"]==row["Pattern_or_URI"], ["Pattern_or_URI","Partition","Pattern_Usage","Device_Name","Device_Description"])])
newrow = data.loc[(data["Pattern_or_URI"]==row["Pattern_or_URI"], ["Pattern_or_URI","Partition","Pattern_Usage","Device_Name","Device_Description"])]
如果我取消注释试图在同一行中使用“分区”> 1查找条目的行,则会出现错误ValueError: No objects to concatenate
。我知道它会通过第一个迭代,而打印语句不会被注释。
是否有更简便或更好的方法?我是熊猫的新手,并一直在想可能有一种方法可以找到我尚未想到的方法。 谢谢。
所需的输出:
Pattern or URI,Route Filter Clause,Partition,Pattern Usage,Owning Object,Owning Object Partition,Cluster ID,Catalog Name,Route String,Device Name,Device Description
"33333",,Prod_P,Device,"33333",Prod_P,,,,SEPFFFFCF87AAEA,device6
"33333",,Dummy_P,Device,"33333",Dummy_P,,,,SEPFFFF18FF65A0,device7
"33333",,Prod_P,Device,"33333",Prod_P,,,,SEPFFFFCFCCAABB,device8
答案 0 :(得分:2)
我认为您正在寻找重复的商品会有点误导。这确实是一个分组问题。
您想在Pattern or URI
中找到与Partition
系列中多个唯一值相对应的相同值的组。
transform
+ nunique
s = df.groupby('Pattern or URI')['Partition'].transform('nunique').gt(1)
df.loc[s]
Pattern or URI Route Filter Clause Partition Pattern Usage Owning Object Owning Object Partition Cluster ID Catalog Name Route String Device Name Device Description
5 33333 NaN Prod_P Device 33333 Prod_P NaN NaN NaN SEPFFFFCF87AAEA device6
6 33333 NaN Dummy_P Device 33333 Dummy_P NaN NaN NaN SEPFFFF18FF65A0 device7
7 33333 NaN Prod_P Device 33333 Prod_P NaN NaN NaN SEPFFFFCFCCAABB device8
答案 1 :(得分:-1)
Using df.drop_duplicates()
as follows:
df=pd.DataFrame({'a':[111,111,111,222,222,333,333,333],
'b':['a','a','a','b','b','a','b','c'],
'c':[12,13,14,15,61,71,81,19]})
df
a b c
0 111 a 12
1 111 a 13
2 111 a 14
3 222 b 15
4 222 b 61
5 333 a 71
6 333 b 81
7 333 c 19
df1=df.drop_duplicates(['a','b'],keep=False)
df1
a b c
5 333 a 71
6 333 b 81
7 333 c 19
Note, instead of assigning it to a new DF, you can add inplace=True
to apply it to the original