正则表达式忽略html标签

时间:2018-06-13 08:54:29

标签: python html regex

我需要在HTML文档的文本中匹配正则表达式。 1)html格式正确,并且 2)没有'<>'不是html标签的符号。 我遇到的问题是我需要知道原始html文档中匹配项的索引,因为我需要将匹配项转换为原始文档中的链接。这意味着我不能使用美丽的汤或其他解析器提取文本。因为解析结果中的匹配将具有不同的索引。我无法在html文档上匹配,因为某些标签字面上出现在一个单词的中间,打破了正则表达式。我需要一种方法来: 1)将已解析文档中的匹配索引映射到原始文档中的位置,或者 2)让我的正则表达式忽略任何标签并继续搜索。

我正在使用python re flavor。 我已经看到了这个问题:skip over HTML tags in Regular Expression patterns但它有所不同,因为OP想要忽略标签上下文中的空格。那里的答案没有给我一个解决方案。

这是一个(非常简化的)示例。有没有办法匹配:

r'(hello world)'

在字符串中:

string = "<p>hell</p>o world"

match.start将返回3?

谢谢!

1 个答案:

答案 0 :(得分:0)

好的,我自己想出了一个解决方案:

import re 

test_html = r'font></font><font face="Tahoma"><font size="4"> alleging that </font></font><font face="Tahoma"><font size="4">soldiers of the Uganda Peoples <span class="scayt-misspell" data-scayt_word="Defence" data-scaytid="32">Defence</span>'


NOT_TAG_REGEX = re.compile(r'(?<=\>)[^<>]+(?=\<)')


def create_chunks(html: str = None):
    """
    Divides an html string into the
    text chunks between tags, while
    storing the start and end indexes of the text
    in both the origional html string, and in the string
    that will be formed by concatenating the text in
    all the chunks.
    """
    matches = NOT_TAG_REGEX.finditer(html)

    text_cursor = 0
    chunks = []
    for match in matches:
        chunk = {
            "text": match.group(),
            "html_start": match.start(),
            "html_end": match.end(),
            "txt_start": text_cursor
        }
        text_cursor += match.end() - match.start()
        chunk["txt_end"] = text_cursor
        chunks.append(chunk)
    return chunks


def to_html_indx(txt_indx, chunks):
    """
    Given the index of a regex match in a string formed from 
    html, returns the index of that same string in the 
    origional html document
    """
    for chunk in chunks:
        if chunk["txt_start"] <= txt_indx <= chunk["txt_end"]:
            txt_indx_in_txt_chunk = txt_indx - chunk["txt_start"]
            html_indx = txt_indx_in_txt_chunk + chunk["html_start"]
            return html_indx
    else:
        print("some error message")
        return None


def main():
    chunks = create_chunks(test_html)
    text = "".join(chunk["text"] for chunk in chunks)
    print(text)
    example_regex = re.compile(r'that soldiers of')
    matches = example_regex.finditer(text)

    for match in matches:
        print("text match: " + match.group())
        txt_start = match.start()
        txt_end = match.end()
        html_start = to_html_indx(txt_start, chunks)
        html_end = to_html_indx(txt_end, chunks)
        print("html match: " + test_html[html_start: html_end])

if __name__ == "__main__":
    main()

这会产生:

text match: that soldiers of

html match: that </font></font><font face="Tahoma"><font size="4">soldiers of