import pandas as pd
df = pd.DataFrame([
['A', 'B', 1, 5],
['B', 'C', 2, 2],
['B', 'A', 1, 1],
['C', 'B', 1, 3]],
columns=['from', 'to', 'type', 'value'])
df = df.set_index(['from', 'to', 'type'])
看起来像这样:
value
from to type
A B 1 5
B C 2 2
A 1 1
C B 1 3
我现在想要从以下意义上删除“重复”行:对于具有任意索引(from, to, type)
的每一行,如果存在行(to, from, type)
,则的值第二行行应添加到第一行行,第二行将被删除。在上面的示例中,值(B, A, 1)
的行1
应添加到第一行并删除,从而产生以下所需结果。
value
from to type
A B 1 6
B C 2 2
C B 1 3
到目前为止,这是我最好的尝试。感觉不必要的冗长和笨重:
# aggregate val of rows with (from,to,type) == (to,from,type)
df2 = df.reset_index()
df3 = df2.rename(columns={'from':'to', 'to':'from'})
df_both = df.join(df3.set_index(
['from', 'to', 'type']),
rsuffix='_b').sum(axis=1)
# then remove the second, i.e. the (to,from,t) row
rows_to_keep = []
rows_to_remove = []
for a,b,t in df_both.index:
if (b,a,t) in df_both.index and not (b,a,t) in rows_to_keep:
rows_to_keep.append((a,b,t))
rows_to_remove.append((b,a,t))
df_final = df_both.drop(rows_to_remove)
df_final
特别是第二个“重复数据删除”步骤感觉非常不合理。 (如何)我可以改进这些步骤吗?
答案 0 :(得分:1)
不确定这有多好,但它肯定不同
import pandas as pd
from collections import Counter
df = pd.DataFrame([
['A', 'B', 1, 5],
['B', 'C', 2, 2],
['B', 'A', 1, 1],
['C', 'B', 1, 3]],
columns=['from', 'to', 'type', 'value'])
df = df.set_index(['from', 'to', 'type'])
ls = df.to_records()
ls = list(ls)
ls2=[]
for l in ls:
i=0
while i <= l[3]:
ls2.append(list(l)[:3])
i+=1
counted = Counter(tuple(sorted(entry)) for entry in ls2)