Question

我试图在从PDF文件中提取的长文本中找到一个字符串，并在文本中获取字符串的位置，然后在字符串之前返回100个单词，之后返回100个单词。问题是提取不完美，所以我遇到了这样的问题：

查询字符串是“测试文本”

文字可能如下：

这是一个有问题的测试文本

你可以看到单词“test”与字母“a”连接，单词“text”与单词“with”连接

所以唯一能与我合作的函数是__contains __，它不会返回单词的位置。

是否有任何想法可以在这样的文本中找到一个单词的所有出现位置？

非常感谢

Answer 1

您未指定所有要求，但这适用于您当前的问题。该程序打印出9 and 42，这是test text两次出现的开始。

import re
filt = re.compile("test text")

for match in filt.finditer('This is atest textwith a problem. another test text'):
    print match.start()

Answer 2

您可以采取以下方法。这首先尝试将整个文本分成单词，并记下每个单词的索引。

接下来，它遍历查找test text的文本，其间可能有0个或更多个空格。对于每个匹配，它记录开始，然后使用Python的bisect库创建在该点之前和之后找到的单词列表，以在words列表中找到所需的条目。

import bisect
import re

test = "aa bb cc dd test text ee ff gg testtextwith hh ii jj"

words = [(w.start(), w.group(0)) for w in re.finditer(r'(\b\w+?\b)', test)]

adjacent_words = 2

for match in re.finditer(r'(test\s*?text)', test):
    start, end = match.span()

    words_start = bisect.bisect_left(words, (start, ''))
    words_end = bisect.bisect_right(words, (end, ''))

    words_before = [w for i, w in words[words_start-adjacent_words : words_start]]
    words_after = [w for i, w in words[words_end : words_end + adjacent_words]]

    #  Adjacent words as a list
    print words_before, match.group(0), words_after

    # Or, surrounding text as is.
    print test[words[words_start-adjacent_words][0] : words[words_end+adjacent_words][0]]

    print

因此对于具有2个相邻单词的此示例，您将获得以下输出：

['cc', 'dd'] test text ['ee', 'ff']
cc dd test text ee ff 

['ff', 'gg'] testtext ['hh', 'ii']
ff gg testtextwith hh ii

Answer 3

如果您要查找字符串中文字的位置，可以使用string.find()。

>>> query_string = 'test text'
>>> text = 'This is atest textwith a problem'
>>> if query_string in text:
        print text.find(query_string)
9

Answer 4

您可以查看允许“模糊”匹配的模块：

>>> import regex
>>> s='This is atest textwith a problem'
>>> regex.search(r'(?:text with){e<2}', s)
<regex.Match object; span=(14, 22), match='textwith', fuzzy_counts=(0, 0, 1)>
>>> regex.search(r'(?:test text){e<2}', s)
<regex.Match object; span=(8, 18), match='atest text', fuzzy_counts=(0, 1, 0)>

您可以匹配包含插入，删除和错误的文本。返回的匹配组具有span和index。

您可以使用regex.findall查找所有潜在的目标匹配。

完美的描述。

在不完美的文本中查找字符串的所有出现

4 个答案: