Pandas数据帧的唯一值

时间:2017-03-16 13:01:05

标签: python pandas

需要一些帮助才能从pandas dataframe获取无关的值

我有:

    >>> df1
     source    target metric
0  acc1.yyy  acx1.xxx  10000
1  acx1.xxx  acc1.yyy  10000

目标是根据source + target或target + source删除唯一值。但我无法通过drop_duplicates得到这个。

>>> df2 = df1.drop_duplicates(subset=['source','target'])
>>> df2
     source    target metric
0  acc1.yyy  acx1.xxx  10000
1  acx1.xxx  acc1.yyy  10000

[更新]

也许复制不是正确的单词,所以让我进一步解释

id  source  target
0   bng1.xxx.00 bdr2.xxx.00
1   bng1.xxx.00 bdr1.xxx.00
2   bdr3.yyy.00 bdr3.xxx.00
3   bdr3.xxx.00 bdr3.yyy.00
4   bdr2.xxx.00 bng1.xxx.00
5   bdr1.xxx.00 bng1.xxx.00

以上,我想删除exampl source = target和target = source的入口。

0 and 4 = same pair
1 and 5 = same pair
2 and 3 = same pair

end goal will be to keep 0 1 2 or 4 5 3 .

2 个答案:

答案 0 :(得分:1)

您需要先对两列进行排序:

df1[['source','target']] = df1[['source','target']].apply(sorted,axis=1)
print (df1)
     source    target  metric
0  acc1.yyy  acx1.xxx   10000
1  acc1.yyy  acx1.xxx   10000

df2 = df1.drop_duplicates(subset=['source','target'])
print (df2)
     source    target  metric
0  acc1.yyy  acx1.xxx   10000

编辑:

似乎需要更改列source - 删除最后3个字符:

df1['source1'] = df1.source.str[:-3]
df1[['source1','target']] = df1[['source1','target']].apply(sorted,axis=1)
print (df1)
   id          source       target      source1
0   0  bng1.xxx.00-00  bng1.xxx.00  bdr2.xxx.00
1   1  bng1.xxx.00-00  bng1.xxx.00  bdr1.xxx.00
2   2  bdr3.yyy.00-00  bdr3.yyy.00  bdr3.xxx.00
3   3  bdr3.xxx.00-00  bdr3.yyy.00  bdr3.xxx.00
4   4  bdr2.xxx.00-00  bng1.xxx.00  bdr2.xxx.00
5   5  bdr1.xxx.00-00  bng1.xxx.00  bdr1.xxx.00

df2 = df1.drop_duplicates(subset=['source1','target'])
df2 = df2.drop('source1', axis=1)
print (df2)
   id          source       target
0   0  bng1.xxx.00-00  bng1.xxx.00
1   1  bng1.xxx.00-00  bng1.xxx.00
2   2  bdr3.yyy.00-00  bdr3.yyy.00

答案 1 :(得分:0)

您对重复项的定义与pandas使用的定义不同。在pandas中,如果相应的条目相同,则将两行视为重复。在下面的示例中,第1行和第2行不重复,因为它们对应的变量具有不同的值,而第3行和第4行是重复的。

df = {'source':['acc1.yyy', 'acx1.xxx', 'acc1.xxx', 'acc1.xxx'], 'target': ['acx1.xxx', 'acc1.yyy', 'acc1.yyy', 'acc1.yyy']}
df = pd.DataFrame(df)
df
     # source    target
# 0  acc1.yyy  acx1.xxx
# 1  acx1.xxx  acc1.yyy
# 2  acc1.xxx  acc1.yyy
# 3  acc1.xxx  acc1.yyy
df.drop_duplicates()
     # source    target
# 0  acc1.yyy  acx1.xxx
# 1  acx1.xxx  acc1.yyy
# 2  acc1.xxx  acc1.yyy

对于您提及的情况,创建一个新列,它是源列和目标列的元组。请尝试以下

df.loc[:, 'src_tgt'] = pd.Series([tuple(sorted(each)) for each in list(zip(df.source.values.tolist(), df.target.values.tolist()))])
df
     # source    target               src_tgt
# 0  acc1.yyy  acx1.xxx  (acc1.yyy, acx1.xxx)
# 1  acx1.xxx  acc1.yyy  (acx1.xxx, acc1.yyy)
# 2  acc1.xxx  acc1.yyy  (acc1.xxx, acc1.yyy)
# 3  acc1.xxx  acc1.yyy  (acc1.xxx, acc1.yyy)
df.drop_duplicates(subset=['src_tgt'])
     # source    target               src_tgt
# 0  acc1.yyy  acx1.xxx  (acc1.yyy, acx1.xxx)
# 2  acc1.xxx  acc1.yyy  (acc1.xxx, acc1.yyy)