从字符串位置提取python中的周围单词

时间:2015-05-07 15:55:07

标签: python regex string search

让我们假设,我有一个字符串:

string="""<p>It is common for content in Arabic, Hebrew, and other languages that use right-to-left scripts to include numerals or include text from  other scripts. Both of these typically flow  left-to-right within the overall right-to-left  context. </p> <p>This article tells you how to write HTML where text with different writing directions is mixed <em>within a paragraph or other HTML block</em> (ie. <dfn id="term_inline">inline or phrasal</dfn> content). (A companion article <a href="/International/questions/qa-html-dir"><cite>Structural markup and right-to-left text in HTML</cite></a> tells you how to use HTML markup for  elements such as <code class="kw">html</code>, and structural markup such as <code class="kw">p</code> or <code class="kw">div</code> and forms.)</p>"""

我在这个字符串中有一个单词的位置,例如:

>>> pos = [m.start() for m in re.finditer("tells you", string)]
>>> pos
[263, 588]

我需要从每个位置后面提取几个单词和几个单词。 如何使用Python和正则表达式实现它?

E.g:

def look_through(d, s):
    r = []
    content = readFile(d["path"])
    content = BeautifulSoup(content)
    content = content.getText()
    pos = [m.start() for m in re.finditer(s, content)]
    if pos:
        if "phrase" not in d:
            d["phrase"] = [s]
        else:
            d["phrase"].append(s)
        for p in pos:
            r.append({"content": content, "phrase": d["phrase"], "name": d["name"]})
    for b in d["decendent"] or []:
            r += look_through(b, s)
    return r

>>> dict = {
    "content": """<p>It is common for content in Arabic, Hebrew, and other languages that use right-to-left scripts to include numerals or include text from  other scripts. Both of these typically flow  left-to-right within the overall right-to-left  context. </p>""", 
    "name": "directory", 
    "decendent": [
         {
            "content": """<p>This article tells you how to write HTML where text with different writing directions is mixed <em>within a paragraph or other HTML block</em> (ie. <dfn id="term_inline">inline or phrasal</dfn> content). (A companion article <a href="/International/questions/qa-html-dir"><cite>Structural markup and right-to-left text in HTML</cite></a> tells you how to use HTML markup for  elements such as <code class="kw">html</code>, and structural markup such as <code class="kw">p</code> or <code class="kw">div</code> and forms.)</p>""", 
            "name": "subdirectory", 
            "decendent": None
        }, 
        {
            "content": """It tells you how to use HTML markup for  elements such as <code class="kw">html</code>, and structural markup such as <code class="kw">p</code> or <code class="kw">div</code> and forms.)""", 
            "name": "subdirectory_two", 
            "decendent": [
                {
                    "content": "Name 4", 
                    "name": "subsubdirectory", 
                    "decendent": None
                }
            ]
        }
    ]
}

所以:

>>> look_through(dict, "tells you")
[
    { "content": "This article tells you how to", "phrase": "tells you", "name": "subdirectory" },
    { "content": "It tells you how to use", "phrase": "tells you", "name": "subdirectory_two" }
]

谢谢!

2 个答案:

答案 0 :(得分:0)

我首先提出使用单词边界元字符,但这不太正确,因为它们不会消耗任何字符串,而且\ B无论如何都不能与我想要的完全匹配。

相反,我建议使用单词边界的基础定义 - 即\ W和\ w之间的边界。在搜索子字符串的任一侧,以正确的顺序查找一个或多个单词字符(\ w)以及一个或多个非单词字符(\ W),并重复多次。

例如: (?:\w+\W+){,3}some string(?:\W+\w+){,3}

在“some string”之后,最多可找到三个单词,最多可找到三个单词。

答案 1 :(得分:0)

你想要一个&#34;一致性&#34;你的正则表达式命中,请在你的正则表达式匹配的地方之前和之后说两个单词。最简单的方法是在那里打破你的字符串并将你的搜索锚定到各个部分的端点。例如,要获得索引263之前和之后的两个单词(您的第一个m.start()),您可以这样做:

m_left = re.search(r"(?:\s+\S+){,2}\s+\S*$", text[:263])
m_right = re.search(r"^\S*\s+(?:\S+\s+){2,}", text[263:])
print(text[m_left.start():m_right.end()])

第一个表达式应该从字符串末尾向后读取:它锚定在结尾$,如果匹配结束于中间单词,可能会跳过部分单词,(\S*),跳过一些空格(\s+),然后匹配最多两个 {2,}字空间序列\s+\S+。它不是完全两个,因为如果我们到达字符串的开头,我们想要返回一个短匹配。

第二个正则表达式反向但相反。

对于一致性,您可能希望在正则表达式匹配的结束之后立即开始阅读,而不是开头。在这种情况下,请使用m.end()作为第二个字符串的开头。

我认为如何将这个与正则表达式匹配列表一起使用非常明显。