我创建了一个数据帧df,其中我有一个包含以下值的列:
category
20150115_Holiday_HK_Misc
20150115_Holiday_SG_Misc
20140116_DE_ProductFocus
20140116_UK_ProductFocus
我想创建3个新列
category | A | B | C
20150115_Holiday_HK_Misc 20150115_Holiday_Misc HK Holiday_Misc
20150115_Holiday_SG_Misc 20150115_Holiday_Misc SG Holiday_Misc
20140116_DE_ProductFocus 20140116_ProductFocus DE ProductFocus
20140116_UK_ProductFocus 20140116_ProductFocus UK ProductFocus
在A栏中,我想取出“_HK” - 我想我需要手动编码,但这很好,我有所有国家代码列表
在B栏中,就是那个国家代码
C列,是A栏,没有开头的日期
我正在尝试这样的事情,但没有走得太远。
df['B'] = np.where([df['category'].str.contains("HK")==True], 'HK', 'Not Specified')
谢谢
答案 0 :(得分:5)
您可以使用Series.str.extract()方法:
# remove two characters (Country Code) surrounded by '_'
df['A'] = df.category.str.replace(r'_\w{2}_', '_')
# extract two characters (Country Code) surrounded by '_'
df['B'] = df.category.str.extract(r'_(\w{2})_', expand=False)
df['C'] = df.A.str.extract(r'\d+_(.*)', expand=False)
结果:
In [148]: df
Out[148]:
category A B C
0 20150115_Holiday_HK_Misc 20150115_Holiday_Misc HK Holiday_Misc
1 20150115_Holiday_SG_Misc 20150115_Holiday_Misc SG Holiday_Misc
2 20140116_DE_ProductFocus 20140116_ProductFocus DE ProductFocus
3 20140116_UK_ProductFocus 20140116_ProductFocus UK ProductFocus
答案 1 :(得分:1)
您也可以使用正则表达式并应用
import re
df['A'] = df.category.apply(lambda x:re.sub(r'(.*)_(\w\w)_(.*)', r'\1_\3', x))
df['B'] = df.category.apply(lambda x:re.sub(r'(.*)_(\w\w)_(.*)', r'\2', x))
df['C'] = df.A.apply(lambda x:re.sub(r'(\d+)_(.*)', r'\2', x))
结果
category A B C
0 20150115_Holiday_HK_Misc 20150115_Holiday_Misc HK Holiday_Misc
1 20150115_Holiday_SG_Misc 20150115_Holiday_Misc SG Holiday_Misc
2 20140116_DE_ProductFocus 20140116_ProductFocus DE ProductFocus
3 20140116_UK_ProductFocus 20140116_ProductFocus UK ProductFocus