如何在文本块中找到所有已知的成分字符串?

时间:2016-10-13 21:35:28

标签: python postgresql parsing elasticsearch nlp

鉴于一系列成分:

text = """Ingredients: organic cane sugar, whole-wheat flour,
       mono & diglycerides. Manufactured in a facility that uses nuts."""

如何从我的postgres数据库中提取成分,或在我的弹性搜索索引中找到它们,而不匹配Ingredients:nuts等标记?

预期输出为:

ingredients = process(text)
# ['cane sugar', 'whole wheat flour', 'mono diglycerides']

1 个答案:

答案 0 :(得分:0)

这个Python代码为我提供了这个输出:['organic cane sugar', 'whole-wheat flour', 'mono & diglycerides'] 它要求成分在“成分:”之后,所有成分都列在“。”之前,如你的情况。

import re
text = """Ingredients: organic cane sugar, whole-wheat flour,
   mono & diglycerides. Manufactured in a facility that uses nuts."""

# Search everything that comes after 'Ingredients: ' and before '.'
m = re.search('(?<=Ingredients: ).+?(?=\.)', text, re.DOTALL) # DOTALL: make . match newlines too
items = m.group(0).replace('\n', ' ').split(',') # Turn newlines into   spaces, make a list of items separated by ','
items = [ i.strip() for i in items ] # Remove leading whitespace in each item
print items