使用Python和BeautifulSoup,只选择未包含在<a>

时间:2015-10-03 19:07:41

标签: python beautifulsoup

I am trying to parse some text sot hat I can urlize (wrap with tags) links that are not formatted. Here's some example text:

text = '<p>This is a <a href="https://google.com">link</a>, this is also a link where the text is the same as the link: <a href="https://google.com">https://google.com</a>, and this is a link too but not formatted: https://google.com</p>'

Here's what I have so far from here中的文本节点:

from django.utils.html import urlize
from bs4 import BeautifulSoup

...

def urlize_html(text):

    soup = BeautifulSoup(text, "html.parser")

    textNodes = soup.findAll(text=True)
    for textNode in textNodes:
        urlizedText = urlize(textNode)
        textNode.replaceWith(urlizedText)

    return = str(soup)

但是这也会捕获示例中的中间链接,导致它被<a>标记中的双重包裹。结果是这样的:

<p>This is a <a href="https://djangosnippets.org/snippets/2072/" target="_blank">link</a>, this is also a link where the test is the same as the link: <a href="https://djangosnippets.org/snippets/2072/" target="_blank">&lt;a href="https://djangosnippets.org/snippets/2072/"&gt;https://djangosnippets.org/snippets/2072/&lt;/a&gt;</a>, and this is a link too but not formatted: &lt;a href="https://djangosnippets.org/snippets/2072/"&gt;https://djangosnippets.org/snippets/2072/&lt;/a&gt;</p>

我可以对textNodes = soup.findAll(text=True)做些什么,以便它只包含尚未包含在<a>标记中的文本节点?

1 个答案:

答案 0 :(得分:5)

Textnodes保留其parent引用,因此您只需测试a代码:

for textNode in textNodes:
    if textNode.parent and getattr(textNode.parent, 'name') == 'a':
        continue  # skip links
    urlizedText = urlize(textNode)
    textNode.replaceWith(urlizedText)