Question

我使用以下代码打开文本文件，删除HTML，并在某个关键字之前和之后搜索字词：

import nltk
import re

text = nltk.clean_html(open('file.txt').read())
text = text.lower()

pattern = re.compile(r'''(?x) ([^\(\)0-9]\.)+ | \w+(-\w+)* |  \.\.\. ''')
text = nltk.regexp_tokenize(text, pattern)

#remove the digits from text
text = [i for i in text if not i.isdigit()]

# Text is now a list of words from file.txt
# I now loop over the Text to find all words before and after a specific keyword

keyword = ['foreign']
for i, w in enumerate(text):  #it gives to the list items numbers
    if w in keyword:
        before_word = text[i-5:i-1] if i > 0 else ''
        before_word = ' '.join(word for word in before_word)
        after_word = text[i+1:i+5] if i+1 < len(text) else ''
        after_word = ' '.join(word for word in after_word)
        print "%s <%s> %s" % (before_word, w, after_word)

如果keyword是一个单词，则此代码效果很好。但是如果我想在'foreign currency'之前和之后找到5个单词怎么办？问题是在text中，所有以空格分隔的单词都是text列表中的不同项。我不能keyword = ['foreign currency']。我该如何解决这个问题？

示例.txt文件here.

Answer 1

您是否考虑过正则表达式？

这将匹配并捕获之前的五个单词，以及之后的五个单词外币

((\w+ ){5})foreign currency(( \w+){5})

编辑：这个正则表达式打破了标签，引号，逗号，括号等内容。并且提供的单词样本可以找到＆＃39;没有5个跟随的单词，所以它不会匹配。

这是一个更新的正则表达式，该短语是5个单词，最后是1-5个单词使用非空间＆＃39;由＆＃39;非单词＆＃39;分隔的字符单词的字符，它捕获为一组，包括搜索文本：

((\S+\W){5}foreign currency(\W\S+){1,5})

否则，您可以尝试：

将文字全部加入一行，无需换行
使用something = text.find('foreign currency')查找该文字的第一个位置
从那里倒数，逐个字符寻找空格，5个字
从最后开始计算，逐个字符寻找空格，5个字
使用something = text.find('foreign currency', previous_end_pos)循环所有这一切，告诉它在上一步结束后开始查找，以查找下一个实例。

Answer 2

您是否考虑过将变量用于“关键字”中的单词数量，并一次按该项目数量迭代文本？

Python：在一对关键字之前和之后搜索单词

2 个答案: