Question

我有一个字符串，我想在其中搜索关键字或短语，并仅返回关键字或短语之前和之后的部分文本。谷歌正是我所说的。

这是我从网上抓取的一个字符串：

“这个过滤器截断了像原始截断词Django过滤器这样的单词，但不是基于单词数量，而是基于字符数。我在建立一个网站时发现需要这个'必须在非常小的文本框上显示标签，并且通过单词截断并不总是给我最好的结果（按字符截断是......好吧......不是那么优雅）。“

现在让我说我想在这里搜索短语building a website，然后输出如下内容：

“ ......当建立一个网站时我需要显示... ”

编辑：我应该更明确地说明这一点。这必须适用于多个字符串/短语，而不仅仅是这个。

Answer 1

使用获取所需短语索引的方法，然后在该索引之前和之后将字符串最多切成N个字符。你可以通过在每一侧寻找距离该索引最近的N个字符的空白来获得幻想，这样你就可以得到完整的单词。

Python字符串函数可以找到您需要的确切字符串：

http://docs.python.org/py3k/library/strings.html

Answer 2

基于其他人的答案（特别是cababunga的），我喜欢一个函数，它将占用最多25个（或多个）字符，停在最后一个单词边界，并提供一个很好的匹配：

import re

def find_with_context(haystack, needle, context_length, escape=True):
    if escape:
        needle = re.escape(needle)
    return re.findall(r'\b(.{,%d})\b(%s)\b(.{,%d})\b' % (context_length, needle, context_length), haystack)

# Returns a list of three-tuples, (context before, match, context after).

用法：

>>> find_with_context(s, 'building a website', 25)
[(' the need for this when ', 'building a website', " where i'd have to show ")]
>>> # Compare this to what it would be without making sure it ends at word boundaries:
... # [('d the need for this when ', 'building a website', " where i'd have to show l")]
...
>>> for match in find_with_context(s, 'building a website', 25):
...     print '<p>...%s<strong>%s</strong>%s...</p>' % match
... 
<p>... the need for this when <strong>building a website</strong> where i'd have to show ...</p>

Answer 3

>>> re.search(r'((?:\S+\s+){,5}\bbuilding a website\b(?:\s+\S+){,5})', s).groups()
("the need for this when building a website where i'd have to show",)

Answer 4

这样的事情可能是：

import re
mo = re.search(r"(.{25})\bbuilding a website\b(.{25})", text)
if mo:
    print mo.group(1), "<b>building a website</b>", mo.group(2)

python截断关键字的文本

4 个答案: