在数据帧中查找第一个匹配对并标记两者

时间:2017-05-15 17:01:12

标签: python python-3.x pandas

我想在多个条件下找到数据中的匹配对。对于每个“代码B”,我想找到一个对应的“代码A”,其中city,start和days字段相等。一旦找到一对,两行都应标记为“已使用”。

启动数据帧:

City  Code  Start    Days
ATL   A     5/15/17  1
ATL   A     5/15/17  1
ATL   A     5/15/17  2
ATL   A     5/16/17  1
ATL   A     5/16/17  3
BOS   A     5/15/17  1
ATL   B     5/15/17  1
ATL   B     5/15/17  2
ATL   B     5/16/17  1
ATL   B     5/16/17  1

最终数据框:

City  Code  Start    Days Status
ATL   A     5/15/17  1    Used
ATL   A     5/15/17  1
ATL   A     5/15/17  2    Used
ATL   A     5/16/17  1    Used
ATL   A     5/16/17  3
BOS   A     5/15/17  1
ATL   B     5/15/17  1    Used
ATL   B     5/15/17  2    Used
ATL   B     5/16/17  1    Used
ATL   B     5/16/17  1

我一直在尝试使用iterrows(),但还是无法让它工作。我不能只将“已使用”值分配给一个匹配的实例。

2 个答案:

答案 0 :(得分:2)

这是愚蠢的,没有优化......但我必须去吃午饭..所以你去吧

d1 = df.assign(Count=df.groupby(df.columns.tolist()).cumcount())
d2 = d1.set_index(d1.columns.tolist()).assign(X=1)

f = lambda x: x.astype(bool)

d3 = d2.X.unstack('Code', fill_value=0).all(1).compress(f).rename('Status')

d4 = d1.join(d3, on=['City', 'Start', 'Days', 'Count'])

d4.assign(Status=d4.Status.replace(True, 'Used').fillna('')).drop('Count', 1)

  City Code    Start  Days Status
0  ATL    A  5/15/17     1   Used
1  ATL    A  5/15/17     1       
2  ATL    A  5/15/17     2   Used
3  ATL    A  5/16/17     1   Used
4  ATL    A  5/16/17     3       
5  BOS    A  5/15/17     1       
6  ATL    B  5/15/17     1   Used
7  ATL    B  5/15/17     2   Used
8  ATL    B  5/16/17     1   Used
9  ATL    B  5/16/17     1       

答案 1 :(得分:1)

我首先使用groupbyCityStartDays进行分组,然后应用函数来标记每个组中的Used(注意它&#39} ; s尚未优化)。

import pandas as pd
from itertools import chain

df = pd.DataFrame([
    ['ATL', 'A', '5/15/17', 1],
    ['ATL', 'A', '5/15/17', 1],
    ['ATL', 'A', '5/15/17', 2],
    ['ATL', 'A', '5/16/17', 1],
    ['ATL', 'A', '5/16/17', 3],
    ['BOS', 'A', '5/15/17', 1],
    ['ATL', 'B', '5/15/17', 1],
    ['ATL', 'B', '5/15/17', 2],
    ['ATL', 'B', '5/16/17', 1],
    ['ATL', 'B', '5/16/17', 1]], 
    columns=['City', 'Code', 'Start', 'Days'])

df.loc[:, 'Status'] = ''

这是标记行AB

的功能
def mark_used(gdf):
    marked_a, marked_b = False, False
    df_marked = []
    if len(gdf) > 1:
        for _, row in gdf.iterrows():
            if row['Code'] == 'A' and not marked_a:
                row['Status'] = 'Used'
                df_marked.append(row)
                marked_a = True
            elif row['Code'] == 'B' and not marked_b:
                row['Status'] = 'Used'
                df_marked.append(row)
                marked_b = True
            else:
                df_marked.append(row)
    else:
        for _, row in gdf.iterrows():
            df_marked.append(row)
    return df_marked

然后将写入的函数应用于每组数据框

ls = [mark_used(gdf) for gid, gdf in df.groupby(['City', 'Start', 'Days'])]
df_marked = pd.DataFrame(list(chain.from_iterable(ls)))
df_marked.sort_index() # sort index back