匹配ID中的多列值

时间:2019-02-26 14:54:42

标签: python pandas

样本DF:

ID     Match1        Match2        Match3     Match4       Match5
1      Yes           No            Yes        Yes          Yes
2      Yes           No            Yes        Yes          No
2      Yes           No            No         Yes          Yes
3      No            Yes           Yes        Yes          No
3      No            Yes           No         No           No
4      Yes           No            Yes        No           No
4      Yes           No            Yes        Yes          Yes

预期DF:

 ID     Match1     Match2        Match3     Match4    Match5 Final_Match
    1      Yes      No            Yes        Yes      Yes     Clear
    2      Yes      No            Yes        Yes      No      Unclear
    2      Yes      No            No         Yes      Yes     Unclear
    3      No       Yes           Yes        Yes      No      Clear
    3      No       Yes           No         No       No      Unclear
    4      Yes      No            Yes        No       No      Unclear
    4      Yes      No            Yes        Yes      Yes     Clear

问题陈述:

  1. 如果ID不是重复的,只需将Clear放在Final_Match列中(示例ID 1)
  2. 如果ID重复,则在Match1至Match5列的ID计数Yes内,以较大的“是”为准,ClearUnclear其他(示例ID 3和4

  3. 如果ID是重复的,则在Match1到Match5列的ID计数Yes内,如果它们具有相等的“是”,则将Unclear都放在.ToList().Take(Tags.Count(h => h.Facility == y.FacilityID)) 中(示例ID 2)

我在ID内找不到任何解决方法吗?

3 个答案:

答案 0 :(得分:2)

另一种方法是:

df['sum_yes']=df.iloc[:,1:6].eq('Yes').sum(axis=1)
df['final']=df.groupby('ID')['sum_yes'].transform\
             (lambda x: np.where((x==x.max())&(~x.duplicated(keep=False)),'Clear','Unclear'))
print(df)

   ID Match1 Match2 Match3 Match4 Match5  sum_yes    final
0   1    Yes     No    Yes    Yes    Yes        4    Clear
1   2    Yes     No    Yes    Yes     No        3  Unclear
2   2    Yes     No     No    Yes    Yes        3  Unclear
3   3     No    Yes    Yes    Yes     No        3    Clear
4   3     No    Yes     No     No     No        1  Unclear
5   4    Yes     No    Yes     No     No        2  Unclear
6   4    Yes     No    Yes    Yes    Yes        4    Clear

PS 。如果需要,您可以删除sum_yes列。

答案 1 :(得分:2)

您也可以使用Groupby.rank来实现:

# Helper Series
s = (df.replace({'Yes': 1, 'No': 0})
     .iloc[:, 1:]
     .sum(1))

df['final_match'] = np.where(s.groupby(df['ID']).rank(ascending=False).eq(1), 'Clear', 'Unclear')

答案 2 :(得分:1)

使用pandas.DataFrame.groupby

final_match = []
for i, d in df.groupby('ID'):
    if len(d) == 1:
        final_match.append('Clear')
    else:
        counter = (d.filter(like='Match') == 'Yes').sum(1)
        if counter.nunique() == 1:
            final_match.extend(['Unclear'] * len(d))
        else:
            final_match.extend(counter.apply(lambda x: 'Clear' if x == max(counter) else 'Unclear').tolist())
df['final_match'] = final_match

print(df)
   ID Match1 Match2 Match3 Match4 Match5 final_match
0   1    Yes     No    Yes    Yes    Yes       Clear
1   2    Yes     No    Yes    Yes     No     Unclear
2   2    Yes     No     No    Yes    Yes     Unclear
3   3     No    Yes    Yes    Yes     No       Clear
4   3     No    Yes     No     No     No     Unclear
5   4    Yes     No    Yes     No     No     Unclear
6   4    Yes     No    Yes    Yes    Yes       Clear

说明:

  • len(d) == 1:如果不重复,请添加Clear
  • counter = (d.filter(like='Match') == 'Yes').sum(1):计算每列中“是”的数量
  • counter.nunique() == 1:如果所有行都具有相同的“是”,则所有行都标记为“不清楚”
  • counter.apply(lambda x: 'Clear' if x == max(counter) else 'Unclear').tolist():如果行的计数不同,则用“清除”标记最高,用“不清楚”标记其余