需要一些帮助才能从pandas dataframe获取无关的值
我有: >>> df1
source target metric
0 acc1.yyy acx1.xxx 10000
1 acx1.xxx acc1.yyy 10000
目标是根据source + target或target + source删除唯一值。但我无法通过drop_duplicates得到这个。
>>> df2 = df1.drop_duplicates(subset=['source','target'])
>>> df2
source target metric
0 acc1.yyy acx1.xxx 10000
1 acx1.xxx acc1.yyy 10000
[更新]
也许复制不是正确的单词,所以让我进一步解释
id source target
0 bng1.xxx.00 bdr2.xxx.00
1 bng1.xxx.00 bdr1.xxx.00
2 bdr3.yyy.00 bdr3.xxx.00
3 bdr3.xxx.00 bdr3.yyy.00
4 bdr2.xxx.00 bng1.xxx.00
5 bdr1.xxx.00 bng1.xxx.00
以上,我想删除exampl source = target和target = source的入口。
0 and 4 = same pair
1 and 5 = same pair
2 and 3 = same pair
end goal will be to keep 0 1 2 or 4 5 3 .
答案 0 :(得分:1)
您需要先对两列进行排序:
df1[['source','target']] = df1[['source','target']].apply(sorted,axis=1)
print (df1)
source target metric
0 acc1.yyy acx1.xxx 10000
1 acc1.yyy acx1.xxx 10000
df2 = df1.drop_duplicates(subset=['source','target'])
print (df2)
source target metric
0 acc1.yyy acx1.xxx 10000
编辑:
似乎需要更改列source
- 删除最后3个字符:
df1['source1'] = df1.source.str[:-3]
df1[['source1','target']] = df1[['source1','target']].apply(sorted,axis=1)
print (df1)
id source target source1
0 0 bng1.xxx.00-00 bng1.xxx.00 bdr2.xxx.00
1 1 bng1.xxx.00-00 bng1.xxx.00 bdr1.xxx.00
2 2 bdr3.yyy.00-00 bdr3.yyy.00 bdr3.xxx.00
3 3 bdr3.xxx.00-00 bdr3.yyy.00 bdr3.xxx.00
4 4 bdr2.xxx.00-00 bng1.xxx.00 bdr2.xxx.00
5 5 bdr1.xxx.00-00 bng1.xxx.00 bdr1.xxx.00
df2 = df1.drop_duplicates(subset=['source1','target'])
df2 = df2.drop('source1', axis=1)
print (df2)
id source target
0 0 bng1.xxx.00-00 bng1.xxx.00
1 1 bng1.xxx.00-00 bng1.xxx.00
2 2 bdr3.yyy.00-00 bdr3.yyy.00
答案 1 :(得分:0)
您对重复项的定义与pandas使用的定义不同。在pandas中,如果相应的条目相同,则将两行视为重复。在下面的示例中,第1行和第2行不重复,因为它们对应的变量具有不同的值,而第3行和第4行是重复的。
df = {'source':['acc1.yyy', 'acx1.xxx', 'acc1.xxx', 'acc1.xxx'], 'target': ['acx1.xxx', 'acc1.yyy', 'acc1.yyy', 'acc1.yyy']}
df = pd.DataFrame(df)
df
# source target
# 0 acc1.yyy acx1.xxx
# 1 acx1.xxx acc1.yyy
# 2 acc1.xxx acc1.yyy
# 3 acc1.xxx acc1.yyy
df.drop_duplicates()
# source target
# 0 acc1.yyy acx1.xxx
# 1 acx1.xxx acc1.yyy
# 2 acc1.xxx acc1.yyy
对于您提及的情况,创建一个新列,它是源列和目标列的元组。请尝试以下
df.loc[:, 'src_tgt'] = pd.Series([tuple(sorted(each)) for each in list(zip(df.source.values.tolist(), df.target.values.tolist()))])
df
# source target src_tgt
# 0 acc1.yyy acx1.xxx (acc1.yyy, acx1.xxx)
# 1 acx1.xxx acc1.yyy (acx1.xxx, acc1.yyy)
# 2 acc1.xxx acc1.yyy (acc1.xxx, acc1.yyy)
# 3 acc1.xxx acc1.yyy (acc1.xxx, acc1.yyy)
df.drop_duplicates(subset=['src_tgt'])
# source target src_tgt
# 0 acc1.yyy acx1.xxx (acc1.yyy, acx1.xxx)
# 2 acc1.xxx acc1.yyy (acc1.xxx, acc1.yyy)