Question

我正在使用BeautifulSoup。我希望如果我看到标签-a href-整个行被删除，但实际上不是。

例如，如果我有：

<a href="/psf-landing/">
This is a test message
</a>

实际上，我可以：

<a>
This is a test message
</a>

那么，我怎么才能得到：

This is a test message

这是我的代码：

soup = BeautifulSoup(content_driver, "html.parser")
for element in soup(text=lambda text: isinstance(text, Comment)):
    element.extract()
for titles in soup.findAll('a'):
    del titles['href']
tree = soup.prettify()

Answer 1

尝试使用.extract()方法。在您的情况下，您只是删除属性

for titles in soup.findAll('a'):
    if  titles['href'] is not None:
        titles.extract()

Answer 2

在这里，您可以看到详细的示例Dzone NLP examples

你需要的是：

text = soup.get_text(strip=True)

这是示例：

from bs4 import BeautifulSoup
import urllib.request 
response = urllib.request.urlopen('http://php.net/') 
html = response.read()
soup = BeautifulSoup(html,"html5lib")
text = soup.get_text(strip=True)
print (text)

Answer 3

您正在寻找unwrap()方法。请看下面的代码段：

html = '''
<a href="/psf-landing/">
This is a test message
</a>'''

soup = BeautifulSoup(html, 'html.parser')
for el in soup.find_all('a', href=True):
    el.unwrap()

print(soup)
# This is a test message

使用href=True只会匹配href作为属性的标记。

BeautifulSoup并删除整个代码

3 个答案: