Question

鉴于一系列成分：

text = """Ingredients: organic cane sugar, whole-wheat flour,
       mono & diglycerides. Manufactured in a facility that uses nuts."""

如何从我的postgres数据库中提取成分，或在我的弹性搜索索引中找到它们，而不匹配Ingredients:或nuts等标记？

预期输出为：

ingredients = process(text)
# ['cane sugar', 'whole wheat flour', 'mono diglycerides']

Answer 1

这个Python代码为我提供了这个输出：['organic cane sugar', 'whole-wheat flour', 'mono & diglycerides'] 它要求成分在“成分：”之后，所有成分都列在“。”之前，如你的情况。

import re
text = """Ingredients: organic cane sugar, whole-wheat flour,
   mono & diglycerides. Manufactured in a facility that uses nuts."""

# Search everything that comes after 'Ingredients: ' and before '.'
m = re.search('(?<=Ingredients: ).+?(?=\.)', text, re.DOTALL) # DOTALL: make . match newlines too
items = m.group(0).replace('\n', ' ').split(',') # Turn newlines into   spaces, make a list of items separated by ','
items = [ i.strip() for i in items ] # Remove leading whitespace in each item
print items

如何在文本块中找到所有已知的成分字符串？

1 个答案: