下面有一个代码,其中我用pd.read_csv
解析一个主机名文本文件,并根据prefix
使它们匹配,这很好用。但是,由于现在有一个要求,我需要在sj12
中寻找第四个字符作为字母,示例sj12应该匹配sh12[a-z]
,即sj12a001
,sj12u003
等。< / p>
我在寻找熊猫是否有办法做到这一点。
#!/grid/common/pkgs/python/v3.6.1/bin/python3
import pandas as pd
import numpy as np
prefixes = ['sj00', 'sj12', 'cr00', 'cr08', 'eu00', 'eu50']
df = pd.read_csv('new_hosts', index_col=False, header=None)
df['prefix'] = df[0].str[:4]
df['grp'] = df.groupby('prefix').cumcount()
df = df.pivot(index='grp', columns='prefix', values=0)
#To drop if all values in the row are nan
df = df[ prefixes ].dropna(axis=0, how='all').replace(np.nan, '', regex=True)
df = df.rename_axis(None)
sj00 sj12 cr00 cr08 eu00 eu50
sj000001 sj124000 cr000011 crn00001 euk000011 eu5000011
sj000002 sj125000 cr000012 crn00002 eu0000012 eu5000013
sj000003 sj12at00 cr000013 crn00003 eu0000013 eu5000014
sj000004 sj12bt00 cr000014 crn00004 eu0000014 eu5000015
sj00 sj12 cr00 cr08 eu00 eu50
sj000001 sj12at00 cr000011 crn00001 euk000011 eu5000011
sj000002 sj12bt00 cr000012 crn00002 eu0000012 eu5000013
sj000003 cr000013 crn00003 eu0000013 eu5000014
sj000004 cr000014 crn00004 eu0000014 eu5000015
在预期的输出上方,您看到sj124000
和sj125000
被删除。
任何帮助将不胜感激。
答案 0 :(得分:0)
我用str.extract
方法解决了它。
df['sj12'] = df['sj12'].str.extract('(\w\w\d\d\w\*)', expand=True)
OR
df['sj12'] = df['sj12'].str.extract('(\w{2}\d{2}\w\*)', expand=True)