如何包含选定的子字符串?

时间:2018-10-08 18:21:58

标签: python string

我要在大字符串中搜索目标文本。我的代码选择字符串中的文本,并在其前面显示40个字符,并在其前面显示40个字符。相反,我希望在目标文本前面显示2个句子,在后面显示2个句子。我的代码:

import re

sentence = "In addition, participation in life situations can be somewhat impaired because of communicative disabilities associated with the disorder and parents’ lack of resources for overcoming this aspect of the disability (i.e. communication devices). The attitudes of service providers are also important. The Australian Rett syndrome research program is based on a biopsychosocial model which integrates aspects of both medical and social models of disability and functioning. The investigation of environmental factors such as equipment and support available to individuals and families and the social capital of the communities in which they live is likely to be integral to understanding the burden of this disorder. The program will use the ICF framework to identify those factors determined to be most beneficial and cost effective in optimising health, function and quality of life for the affected child and her family."

sub = "biopsychosocial model"

def find_all_substrings(string, sub):
    starts = [match.start() for match in re.finditer(re.escape(sub), string.lower())]
    return starts 

substrings = find_all_substrings(sentence, sub)
for pos in substrings: print(sentence[pos-40:pos+40])

如何在目标文本前面显示2个句子,在后面显示2个句子?

1 个答案:

答案 0 :(得分:4)

您可以先将文本拆分为句子,然后查找包含要查找的子字符串的所有句子(及其索引)。然后只需将句子切成任何找到的句子即可。

下面是一个示例(使用nltk.tokenize.sent_tokenize):

from nltk.tokenize import sent_tokenize

text = "In addition, participation in life situations can be somewhat impaired because of communicative disabilities associated with the disorder and parents’ lack of resources for overcoming this aspect of the disability (i.e. communication devices). The attitudes of service providers are also important. The Australian Rett syndrome research program is based on a biopsychosocial model which integrates aspects of both medical and social models of disability and functioning. The investigation of environmental factors such as equipment and support available to individuals and families and the social capital of the communities in which they live is likely to be integral to understanding the burden of this disorder. The program will use the ICF framework to identify those factors determined to be most beneficial and cost effective in optimising health, function and quality of life for the affected child and her family."
sentences = sent_tokenize(text)

sub = "biopsychosocial model"
matching_indices = [i for i, sentence in enumerate(sentences) if sub in sentence]

n_sent_padding = 1
displayed_sentences = [
    ' '.join(sentences[i-n_sent_padding:i+n_sent_padding+1])
    for i in matching_indices
]

这将找到每个包含子字符串的句子的索引(放在matching_indices中),然后displayed_sentences包含匹配的句子之前和之后的句子(根据n_sent_padding的编号)。

然后displayed_sentences是:

['The attitudes of service providers are also important. The Australian Rett syndrome research program is based on a biopsychosocial model which integrates aspects of both medical and social models of disability and functioning. The investigation of environmental factors such as equipment and support available to individuals and families and the social capital of the communities in which they live is likely to be integral to understanding the burden of this disorder.']

请注意nltk是如何拆分句子的:有时它会有点奇怪(例如,按“先生”中的句点拆分)。 This post关于如何调整句子标记器。