Question

这是一个关于编程的学校项目，我应该只使用重新导入。

我正在尝试在包含由参数定义的某些表达式的文本文件中查找所有句子，并将其提取到列表中。通过搜索其他帖子，我找到了句子开头和结尾的点使我走了一半，但是如果那里有一个带点的数字，则会破坏结果。

如果我有txt：This is a text. I dont want for the result to stop in the number 990.576, I want to extract the phrase with this expression. Its not working.

search = re.findall(r"([^.]*?"+expression+"[^.]*\.", txt)

我得到的结果是['576, I want to extract the phrase with this expression',]

我想要的结果是['I dont want for the result to stop in the number 990.576, I want to extract the phrase with this expression.']

我还在这方面还是初学者，有什么帮助吗？

Answer 1

如果我没记错的话，您想拆分句子。为此，最好的正则表达式是：

sentences = re.split(r' *[\.\?!][\'"\)\]]* *', txt)

如果这不起作用。您可以使用此正则表达式替换句子中逗号的加分点：

txt = re.sub(r'(\d*)\.(\d+)', r'\1,\2', txt)

Answer 2

Tokenize the text into sentences with NLTK，然后使用整个单词搜索或常规的子字符串检查。

完整单词搜索示例：

import nltk, re
text = "This is a text. I dont want for the result to stop in the number 990.576, I want to extract the phrase with this expression. Its not working."
sentences = nltk.sent_tokenize(text)
word = "expression"
print([sent for sent in sentences if re.search(r'\b{}\b'.format(word), sent)])
# => ['I dont want for the result to stop in the number 990.576, I want to extract the phrase with this expression.']

如果您不需要整个单词搜索，请将if re.search(r'\b{}\b'.format(word), sent)替换为if word in sent。

Answer 3

也许不是最好的解决方案，但是您可以匹配文本中的所有句子，然后找到表达式，如下所示：

sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)

matching = [s for s in sentences if "I want to extract the phrase with this expression" in s]

print(matching)

#Result:
# ['I dont want for the result to stop in the number 990.576, I want to extract the phrase with this expression.']

希望有帮助！

使用正则表达式

3 个答案: