提取熊猫系列中的字符串

时间:2021-02-23 04:40:23

标签: python regex

我有以下熊猫系列:

cfia_recalls_merged['title'].head()
0                                     One Ocean brand Sliced Smoked Wild Sockeye Salmon recalled due to Listeria monocytogenes
1                                                Pastene brand Green Olives Sliced recalled due to container integrity defects
2                                              Casa Italia brand Soppressata Piccante Salami recalled due to possible spoilage
3                                                                                Obiji brand Palm Oil recalled due to Sudan IV
4    One Degree Organic Foods brand Gluten Free Sprouted Rolled Oats recalled due to packaging integrity defects and rancidity
Name: title, dtype: object

我想提取每个字符串的某些部分并附加到一个新列。示例:

test = {'brand': ['One Ocean', 'Pastene', 'Casa Italia'], 'product': ['Sliced Smoked Wild Sockeye Salmon', 'Green Olives Sliced', 'Soppressata Piccante Salami'], 'hazard': ['Listeria monocytogenes', 'container integrity defects', 'possible spoilage']}
example = pd.DataFrame(test)
example

    brand         product                              hazard
0   One Ocean     Sliced Smoked Wild Sockeye Salmon    Listeria monocytogenes
1   Pastene       Green Olives Sliced                  container integrity defects
2   Casa Italia   Soppressata Piccante Salami          possible spoilage

基本上我的分隔符是“品牌”和“由于”

如何使用正则表达式和捕获组执行此操作?

感谢任何帮助。提前致谢!

1 个答案:

答案 0 :(得分:1)

您可以在此处使用 str.extract

cfia_recalls_merged['brand'] = cfia_recalls_merged['title'].str.extract(r'^(.*?) brand\b')
cfia_recalls_merged['product'] = cfia_recalls_merged['title'].str.extract(r'^.*? brand (.*?) recalled due to\b')
cfia_recalls_merged['hazard'] = cfia_recalls_merged['title'].str.extract(r'\brecalled due to (.*)$')