我想在多个条件下找到数据中的匹配对。对于每个“代码B”,我想找到一个对应的“代码A”,其中city,start和days字段相等。一旦找到一对,两行都应标记为“已使用”。
启动数据帧:
City Code Start Days
ATL A 5/15/17 1
ATL A 5/15/17 1
ATL A 5/15/17 2
ATL A 5/16/17 1
ATL A 5/16/17 3
BOS A 5/15/17 1
ATL B 5/15/17 1
ATL B 5/15/17 2
ATL B 5/16/17 1
ATL B 5/16/17 1
最终数据框:
City Code Start Days Status
ATL A 5/15/17 1 Used
ATL A 5/15/17 1
ATL A 5/15/17 2 Used
ATL A 5/16/17 1 Used
ATL A 5/16/17 3
BOS A 5/15/17 1
ATL B 5/15/17 1 Used
ATL B 5/15/17 2 Used
ATL B 5/16/17 1 Used
ATL B 5/16/17 1
我一直在尝试使用iterrows(),但还是无法让它工作。我不能只将“已使用”值分配给一个匹配的实例。
答案 0 :(得分:2)
这是愚蠢的,没有优化......但我必须去吃午饭..所以你去吧
d1 = df.assign(Count=df.groupby(df.columns.tolist()).cumcount())
d2 = d1.set_index(d1.columns.tolist()).assign(X=1)
f = lambda x: x.astype(bool)
d3 = d2.X.unstack('Code', fill_value=0).all(1).compress(f).rename('Status')
d4 = d1.join(d3, on=['City', 'Start', 'Days', 'Count'])
d4.assign(Status=d4.Status.replace(True, 'Used').fillna('')).drop('Count', 1)
City Code Start Days Status
0 ATL A 5/15/17 1 Used
1 ATL A 5/15/17 1
2 ATL A 5/15/17 2 Used
3 ATL A 5/16/17 1 Used
4 ATL A 5/16/17 3
5 BOS A 5/15/17 1
6 ATL B 5/15/17 1 Used
7 ATL B 5/15/17 2 Used
8 ATL B 5/16/17 1 Used
9 ATL B 5/16/17 1
答案 1 :(得分:1)
我首先使用groupby
对City
,Start
和Days
进行分组,然后应用函数来标记每个组中的Used
(注意它&#39} ; s尚未优化)。
import pandas as pd
from itertools import chain
df = pd.DataFrame([
['ATL', 'A', '5/15/17', 1],
['ATL', 'A', '5/15/17', 1],
['ATL', 'A', '5/15/17', 2],
['ATL', 'A', '5/16/17', 1],
['ATL', 'A', '5/16/17', 3],
['BOS', 'A', '5/15/17', 1],
['ATL', 'B', '5/15/17', 1],
['ATL', 'B', '5/15/17', 2],
['ATL', 'B', '5/16/17', 1],
['ATL', 'B', '5/16/17', 1]],
columns=['City', 'Code', 'Start', 'Days'])
df.loc[:, 'Status'] = ''
这是标记行A
和B
def mark_used(gdf):
marked_a, marked_b = False, False
df_marked = []
if len(gdf) > 1:
for _, row in gdf.iterrows():
if row['Code'] == 'A' and not marked_a:
row['Status'] = 'Used'
df_marked.append(row)
marked_a = True
elif row['Code'] == 'B' and not marked_b:
row['Status'] = 'Used'
df_marked.append(row)
marked_b = True
else:
df_marked.append(row)
else:
for _, row in gdf.iterrows():
df_marked.append(row)
return df_marked
然后将写入的函数应用于每组数据框
ls = [mark_used(gdf) for gid, gdf in df.groupby(['City', 'Start', 'Days'])]
df_marked = pd.DataFrame(list(chain.from_iterable(ls)))
df_marked.sort_index() # sort index back