python pandas匹配基于前缀的字符串

时间:2019-07-12 10:41:49

标签: python-3.x pandas pandas-groupby

下面有一个代码,其中我用pd.read_csv解析一个主机名文本文件,并根据prefix使它们匹配,这很好用。但是,由于现在有一个要求,我需要在sj12中寻找第四个字符作为字母,示例sj12应该匹配sh12[a-z],即sj12a001sj12u003等。< / p>

我在寻找熊猫是否有办法做到这一点。

#!/grid/common/pkgs/python/v3.6.1/bin/python3
import pandas as pd
import numpy as np

prefixes = ['sj00', 'sj12', 'cr00', 'cr08', 'eu00', 'eu50']

df = pd.read_csv('new_hosts', index_col=False, header=None)
df['prefix'] = df[0].str[:4]
df['grp'] = df.groupby('prefix').cumcount()
df = df.pivot(index='grp', columns='prefix', values=0)

#To drop if all values in the row are nan
df = df[ prefixes ].dropna(axis=0, how='all').replace(np.nan, '', regex=True)
df = df.rename_axis(None)

具有以上代码的当前输出:

sj00        sj12        cr00        cr08        eu00        eu50
sj000001    sj124000    cr000011    crn00001    euk000011   eu5000011
sj000002    sj125000    cr000012    crn00002    eu0000012   eu5000013
sj000003    sj12at00    cr000013    crn00003    eu0000013   eu5000014
sj000004    sj12bt00    cr000014    crn00004    eu0000014   eu5000015

预期输出:

    sj00        sj12        cr00        cr08        eu00        eu50
    sj000001    sj12at00    cr000011    crn00001    euk000011   eu5000011
    sj000002    sj12bt00    cr000012    crn00002    eu0000012   eu5000013
    sj000003                cr000013    crn00003    eu0000013   eu5000014
    sj000004                cr000014    crn00004    eu0000014   eu5000015

在预期的输出上方,您看到sj124000sj125000被删除。

任何帮助将不胜感激。

1 个答案:

答案 0 :(得分:0)

我用str.extract方法解决了它。

df['sj12'] = df['sj12'].str.extract('(\w\w\d\d\w\*)', expand=True)

OR

df['sj12'] = df['sj12'].str.extract('(\w{2}\d{2}\w\*)', expand=True)