你好,我有一个df,例如:
Groups COL1
G1 Seq:1
G1 Seq:2
G1 Seq_1
G1 Seq:4
G2 Seq_2
G2 Seq_3
G2 Seq_4
G3 Seq:5
G3 Seq:6
G4 Seq:7
G4 Seq_5
我想数一下:
有人不知道吗?我想我应该起诉re.sub
并在熊猫中对每个Groups
求和?
答案 0 :(得分:2)
使用Series.str.contains
作为掩码,然后将numpy.setdiff1d
的DataFrame.loc
过滤值与~
或掩码的倒置掩码进行比较:
m = df['COL1'].str.contains(':')
a = np.setdiff1d(df['Groups'], df.loc[~m, 'Groups']).tolist()
print (a)
['G3']
c = np.setdiff1d(df['Groups'], df.loc[m, 'Groups']).tolist()
print (c)
['G2']
b = np.setdiff1d(df.loc[~m, 'Groups'], c).tolist()
print (b)
['G1', 'G4']
计数的Anf获取列表的长度:
print (len(a))
print (len(b))
print (len(c))
答案 1 :(得分:2)
您可以使用pd.Series.str.contains
进行计数,然后使用GroupBy.all
和GroupBy.any
om = df['COL1'].str.contains(':')
one = om.groupby(df['Groups']).all().sum() # 1
two = om.groupby(df['Groups']).any().sum() - one # 2
# minus one because `any` counts all Trues too so we need
# subtract groups with all Trues.
three = (~om).groupby(df['Groups']).all().sum() # 1