我有以下数据帧:
test = {'title': ['Undeclared milk in Burnbrae', 'Undeclared milk in certain Bumble', 'Certain cheese products may contain listeria', 'Ocean brand recalled due to Salmonella', 'IQF Raspberries due to Listeria']}
example = pd.DataFrame(test)
example
title
0 Undeclared milk in Burnbrae
1 Undeclared milk in certain Bumble
2 Certain cheese products may contain listeria
3 Ocean brand recalled due to Salmonella
4 IQF Raspberries due to Listeria
我想在同一列中提取以下字符串。我希望我的结果看起来像这样:
test = {'hazard': ['Undeclared milk', 'Undeclared milk', 'listeria', 'Salmonella', 'Listeria'], 'title': ['Undeclared milk in Burnbrae', 'Undeclared milk in certain Bumble', 'Certain cheese products may contain listeria', 'Ocean brand recalled due to Salmonella', 'IQF Raspberries due to Listeria']}
example2 = pd.DataFrame(test)
example2
hazard title
0 Undeclared milk Undeclared milk in Burnbrae
1 Undeclared milk Undeclared milk in certain Bumble
2 listeria Certain cheese products may contain listeria
3 Salmonella Ocean brand recalled due to Salmonella
4 Listeria IQF Raspberries due to Listeria
基本上我的分隔符是 in
、may contain
和 due to
example['hazard'] = example['title'].str.extract(r'^(.*?) in\b')
example['hazard'] = example['title'].str.extract(r'\b may contain (.*)$')
example['hazard'] = example['title'].str.extract(r'\b due to (.*)$')
我编写了上面的代码来测试每个分隔符,但想提取同一列中的所有分隔符。
我该怎么做?
感谢所有帮助
答案 0 :(得分:3)
您可以将分隔符加入列表,并通过 "|".join
将其加入以将其转换为更大的模式。从那里,Series.str.extract
可以获得所有匹配项,然后我们重塑以匹配原始大小。
seperators = [r"^(.*?) in\b", r"\b may contain (.*)$", r"\b due to (.*)$"]
sep_pattern = r"|".join(seperators)
example["hazard"] = (example["title"].str.extract(sep_pattern)
.stack()
.droplevel(1))
print(example)
title hazard
0 Undeclared milk in Burnbrae Undeclared milk
1 Undeclared milk in certain Bumble Undeclared milk
2 Certain cheese products may contain listeria listeria
3 Ocean brand recalled due to Salmonella Salmonella
4 IQF Raspberries due to Listeria Listeria
答案 1 :(得分:1)
获得相同结果的更多第一原则方法:
def func(s: str):
check1 = re.search(r'^(.*?) in\b',s)
check2 = re.search(r'\b may contain (.*)$',s)
check3 = re.search(r'\b due to (.*)$',s)
if check1:
return check1.group(1)
elif check2:
return check2.group(1)
elif check3:
return check3.group(1)
else:
return np.nan
example["hazard"] = example["title"].apply(func)