pandas - 根据交换的索引级别聚合/删除DataFrame中的重复行 - Thinbug

根据交换的索引级别聚合/删除DataFrame中的重复行

时间：2016-02-08 16:30:48

标签： pandas

示例输入

import pandas as pd
df = pd.DataFrame([
        ['A', 'B', 1, 5],
        ['B', 'C', 2, 2],
        ['B', 'A', 1, 1],
        ['C', 'B', 1, 3]], 
        columns=['from', 'to', 'type', 'value']) 
df = df.set_index(['from', 'to', 'type'])

看起来像这样：

                  value
from  to    type    
A     B     1     5
B     C     2     2
      A     1     1
C     B     1     3

目标

我现在想要从以下意义上删除“重复”行：对于具有任意索引(from, to, type)的每一行，如果存在行(to, from, type)，则的值第二行行应添加到第一行行，第二行将被删除。在上面的示例中，值(B, A, 1)的行1应添加到第一行并删除，从而产生以下所需结果。

样本结果

                  value
from  to  type
A     B   1       6
B     C   2       2
C     B   1       3

到目前为止，这是我最好的尝试。感觉不必要的冗长和笨重：

# aggregate val of rows with (from,to,type) == (to,from,type) 
df2 = df.reset_index()
df3 = df2.rename(columns={'from':'to', 'to':'from'})
df_both = df.join(df3.set_index(
                    ['from', 'to', 'type']), 
                    rsuffix='_b').sum(axis=1)

# then remove the second, i.e. the (to,from,t) row
rows_to_keep = []
rows_to_remove = []
for a,b,t in df_both.index:
    if (b,a,t) in df_both.index and not (b,a,t) in rows_to_keep:
        rows_to_keep.append((a,b,t))
        rows_to_remove.append((b,a,t))

df_final = df_both.drop(rows_to_remove)
df_final

特别是第二个“重复数据删除”步骤感觉非常不合理。（如何）我可以改进这些步骤吗？

1 个答案:

答案 0 :(得分：1)

不确定这有多好，但它肯定不同

  import pandas as pd
  from collections import Counter

  df = pd.DataFrame([
          ['A', 'B', 1, 5],
          ['B', 'C', 2, 2],
          ['B', 'A', 1, 1],
          ['C', 'B', 1, 3]], 
          columns=['from', 'to', 'type', 'value']) 
  df = df.set_index(['from', 'to', 'type'])
  ls = df.to_records()
  ls = list(ls)
  ls2=[]
  for l in ls:
      i=0
      while i <= l[3]:
          ls2.append(list(l)[:3])
          i+=1
  counted = Counter(tuple(sorted(entry)) for entry in ls2)