我有以下熊猫系列:
cfia_recalls_merged['title'].head()
0 One Ocean brand Sliced Smoked Wild Sockeye Salmon recalled due to Listeria monocytogenes
1 Pastene brand Green Olives Sliced recalled due to container integrity defects
2 Casa Italia brand Soppressata Piccante Salami recalled due to possible spoilage
3 Obiji brand Palm Oil recalled due to Sudan IV
4 One Degree Organic Foods brand Gluten Free Sprouted Rolled Oats recalled due to packaging integrity defects and rancidity
Name: title, dtype: object
我想提取每个字符串的某些部分并附加到一个新列。示例:
test = {'brand': ['One Ocean', 'Pastene', 'Casa Italia'], 'product': ['Sliced Smoked Wild Sockeye Salmon', 'Green Olives Sliced', 'Soppressata Piccante Salami'], 'hazard': ['Listeria monocytogenes', 'container integrity defects', 'possible spoilage']}
example = pd.DataFrame(test)
example
brand product hazard
0 One Ocean Sliced Smoked Wild Sockeye Salmon Listeria monocytogenes
1 Pastene Green Olives Sliced container integrity defects
2 Casa Italia Soppressata Piccante Salami possible spoilage
基本上我的分隔符是“品牌”和“由于”
如何使用正则表达式和捕获组执行此操作?
感谢任何帮助。提前致谢!
答案 0 :(得分:1)
您可以在此处使用 str.extract
:
cfia_recalls_merged['brand'] = cfia_recalls_merged['title'].str.extract(r'^(.*?) brand\b')
cfia_recalls_merged['product'] = cfia_recalls_merged['title'].str.extract(r'^.*? brand (.*?) recalled due to\b')
cfia_recalls_merged['hazard'] = cfia_recalls_merged['title'].str.extract(r'\brecalled due to (.*)$')