下面是我拥有的pandas
data frame
的子集
index name_matches dist_matches
38 PO1000000345 M-00346 M-00346
39 PO1000000352 M-00804
40 PO1000000354 M-00196 M-00196
41 PO1000000355 M-00514 M-00514
42 PO1000000382 M-00353,M-00354 M-00354
43 PO1000000411
44 PO1000000451
45 PO1000000512 M-00680
46 PO1000000530 M-00089
47 PO1000000531 M-00087 M-00087
48 PO1000000553 M-00917,M-00920,M-00922 M-00920
我正在尝试获取一个新列(comb_matches
),该列会提取name_matches
和dist_matches
列中的匹配值。有时,列中会有一个或多个用逗号分隔的值。我想要获取的输出示例如下所示。
index name_matches dist_matches comb_matches
38 PO1000000345 M-00346 M-00346 M-00346
39 PO1000000352 M-00804
40 PO1000000354 M-00196 M-00196 M-00196
41 PO1000000355 M-00514 M-00514 M-00514
42 PO1000000382 M-00353,M-00354 M-00354 M-00354
43 PO1000000411
44 PO1000000451
45 PO1000000512 M-00680
46 PO1000000530 M-00089
47 PO1000000531 M-00087 M-00087 M-00087
48 PO1000000553 M-00917,M-00920,M-00922 M-00920 M-00920
有什么简单的方法可以达到上述要求?
答案 0 :(得分:5)
没有简单方式。熊猫不是为此类任务而设计的,它不是矢量化的。最好的选择可能是列表理解:
s1 = df['dist_matches'].astype(str)
s2 = df['name_matches'].astype(str).str.split(',')
mask = [i in j for i, j in zip(s1, s2)]
df['comb_match'] = np.where(mask, df['dist_matches'], np.nan)
要证明Pandas str
方法并不是真正的矢量化事实,
# Python 3.6.5, Pandas 0.23.0
def wen(df):
Bool = df.name_matches.str.split(',',expand=True).isin(df.dist_matches).any(1)
df['comb_match'] = np.where(Bool, df.dist_matches, '')
return df
def jpp(df):
s1 = df['dist_matches'].astype(str)
s2 = df['name_matches'].astype(str).str.split(',')
mask = [i in j for i, j in zip(s1, s2)]
df['comb_match'] = np.where(mask, df['dist_matches'], np.nan)
return df
df = pd.concat([df]*1000, ignore_index=True)
assert jpp(df).equals(wen(df))
%timeit jpp(df) # 12.2 ms
%timeit wen(df) # 32.7 ms
答案 1 :(得分:4)
在str.split
之前使用isin
。然后我们将布尔值实现为np.where
Bool=df.name_matches.str.split(',',expand=True).isin(df.dist_matches).any(1)
df['comb_match']=np.where(Bool,df.dist_matches,'')
df
Out[520]:
index name_matches dist_matches comb_match
38 PO1000000345 M-00346 M-00346 M-00346
39 PO1000000352 M-00804
40 PO1000000354 M-00196 M-00196 M-00196
41 PO1000000355 M-00514 M-00514 M-00514
42 PO1000000382 M-00353,M-00354 M-00354 M-00354
43 PO1000000411
44 PO1000000451
45 PO1000000512 M-00680
46 PO1000000530 M-00089
47 PO1000000531 M-00087 M-00087 M-00087
48 PO1000000553 M-00917,M-00920,M-00922 M-00920 M-00920