Question

我需要从页面中提取所有文本和<a>标记，但我不知道该怎么做。以下是我到目前为止的情况：

from bs4 import BeautifulSoup

def cleanMe(html):
    soup = BeautifulSoup(html) # create a new bs4 object from the html data loaded
    for script in soup(["script", "style"]): # remove all javascript and stylesheet code
    script.decompose()
    # get text
    text = soup.get_text()
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)
    return text

testhtml = "<!DOCTYPE HTML>\n<head>\n<title>THIS IS AN EXAMPLE </title><style>.call {font-family:Arial;}</style><script>getit</script><body>I need this text with this <a href="http://example.com/">link</a> captured.</body>"
cleaned = cleanMe(testhtml)
print (cleaned)

输出：

THIS IS AN EXAMPLE I need this text with this link captured.

我想要的输出：

THIS IS AN EXAMPLE I need this text with this <a href="http://example.com/">link</a> captured.

Answer 1

考虑以下内容： -

def cleanMe(html):
    soup = BeautifulSoup(html,'html.parser') # create a new bs4 object from the html data loaded
    for script in soup(["script", "style"]): # remove all javascript and stylesheet code
        script.decompose()
    # get text
    text = soup.get_text()
    for link in soup.find_all('a'):
        if 'href' in link.attrs:
            repl=link.get_text()
            href=link.attrs['href']
            link.clear()
            link.attrs={}
            link.attrs['href']=href
            link.append(repl)
            text=re.sub(repl+'(?!= *?</a>)',str(link),text,count=1)

    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)
    return text

我们新做的是

    for link in soup.find_all('a'):
        text=re.sub(link.get_text()+'(?!= *?</a>)',str(link),text,count=1)

对于每组锚标记，将锚点（link）中的文本替换为整个锚本身。请注意，我们只会在第一个出现的link文字中进行一次替换。

正则表达式link.get_text()+'(?!= *?</a>)'确保我们仅在link文本未被替换时才替换(?!= *?</a>)文本。

link是一个负向前瞻，它可以避免附加</a>时未出现的任何//* Conditionally remove the Featured Product Image from the single-product page function remove_gallery_and_product_images() { if ( is_product() && is_single(array(1092, 1093, 1094) ) ) { remove_action( 'woocommerce_before_single_product_summary', 'woocommerce_show_product_images', 20 ); add_filter('body_class', 'no_prod_imgs_class'); } } add_action('template_redirect', 'remove_gallery_and_product_images'); //* Add CSS for removed featured images from multiple specific product detail pages function no_prod_imgs_class($classes) { $classes[] = 'no-product-images'; return $classes; }。

但这不是最简单的方法。最简单的方法是遍历每个标签并将文本输出。

查看工作代码here

Answer 2

考虑使用除BeautifulSoup之外的其他库。我用这个：

from bleach import clean

def strip_html(self, src, allowed=['a']):
    return clean(src, tags=allowed, strip=True, strip_comments=True)

通过BeautifulSoup删除除一个标记之外的所有html标记

2 个答案: