在以下情况下,如何获取d2和d3的文本?
import bs4
htmldoc = '''<html>
<div class="a">
<div class="b">abc def1</div>
<div class="c">123</div>
</div>
<div class="a">
<div class="b">abc def<sup>2</sup></div>
<div class="c">456</div>
</div>
<div class="a">
<div class="b">abc <!-- -->def3</div>
<div class="c">789</div>
</div>
</html>'''
soup = bs4.BeautifulSoup(htmldoc, 'html.parser')
d1 = soup.find('div', class_ = 'b', text = 'abc def1').findNext('div').text
print(d1)
d2 = soup.find('div', class_ = 'b', text = 'abc def2').findNext('div').text
print(d2)
d3 = soup.find('div', class_ = 'b', text = 'abc def3').findNext('div').text
print(d3)
它仅适用于d1,但对于d2和d3则存在错误...
答案 0 :(得分:0)
这是因为您的文本匹配规则不适用于文本
我的解决方案似乎是一种解决方法,但是无论如何,您可以尝试一下。
您应该先使用provider "openstack" {
auth_url = "${var.auth_url}"
user_name = "${var.username}"
password = "${var.password}"
region = "${var.region}"
tenant_name = "${var.tenant_name}"
tenant_id = "${var.tenant_id}"
}
和unwrap
方法删除未使用的标签和注释
extract
输出:
import bs4
htmldoc = '''<html>
<div class="a">
<div class="b">abc def1</div>
<div class="c">123</div>
</div>
<div class="a">
<div class="b">abc def<sup>2</sup></div>
<div class="c">456</div>
</div>
<div class="a">
<div class="b">abc <!-- -->def3</div>
<div class="c">789</div>
</div>
</html>'''
def get_new_soup():
soup = bs4.BeautifulSoup(htmldoc, 'html.parser')
divs_b = soup.find_all('div',{'class','b'})
for div in divs_b:
# Remove unwanted tag (like <sup> here)
if div.sup:
div.sup.unwrap()
# Remove comments
for element in div(text=lambda text: isinstance(text, bs4.Comment)):
element.extract()
soup_str = str(soup)
return(bs4.BeautifulSoup(soup_str,'html.parser'))
soup = get_new_soup()
d1 = soup.find('div', class_ = 'b', text = 'abc def1').findNext('div').text
print(d1)
d2 = soup.find('div', class_ = 'b', text = 'abc def2').findNext('div').text
print(d2)
d3 = soup.find('div', class_ = 'b', text = 'abc def3').findNext('div').text
print(d3)
编辑:
按照评论中的要求,我看到的获取数据的唯一方法是:
123
456
789
输出:
import bs4
htmldoc = '''<html>
<div class="a">
<div class="b">abc def1</div>
<div class="c">123</div>
</div>
<div class="a">
<div class="b">abc def<sup>2</sup></div>
<div class="c">456</div>
</div>
<div class="a">
<div class="b">abc <!-- -->def3</div>
<div class="c">789</div>
</div>
</html>'''
def get_new_soup():
soup = bs4.BeautifulSoup(htmldoc, 'html.parser')
divs_b = soup.find_all('div',{'class','b'})
for div in divs_b:
# Remove comments
for element in div(text=lambda text: isinstance(text, bs4.Comment)):
element.extract()
soup_str = str(soup)
return(bs4.BeautifulSoup(soup_str,'html.parser'))
soup = get_new_soup()
search_text = ['abc def1', 'abc def<sup>2</sup>', 'abc def3']
divs = soup.find_all('div', class_ = 'b')
for div in divs:
content = ''.join(str(c) for c in div.contents)
if content in search_text:
print(div.findNext('div').text)
答案 1 :(得分:0)
我可能会丢失一些东西,但是这样做:
d1 = soup.find_all('div', class_ = 'c')
for i in (d1):
print(i.text)
给出以下输出:
123
456
789
这是您要找的吗?