the html code is like this:
<div class="AAA">Text of AAA<a href="......AAA/url">Display text of URL A</a></div>
<div class="BBB">Text of BBB<a href="......BBB/url">Display text of URL B</a></div>
<div class="CCC">Text of CCC</div>
<div class="DDD">Text of DDD</div>
I want to parse the text for all the div, while check if there is url exist, if yes then also extract it out and display in output
output like this:
Text of AAA
Display text of URL A
......AAA/url
Text of BBB
Display text of URL B
......BBB/url
Text of CCC
Text of DDD
i tried to nest the loop of find_all('a') within find_all('div') loop, but messed up my output
答案 0 :(得分:1)
不知道您的代码是什么样子,但是基本的想法是这样的:
data = soup.findAll('div')
for div in data:
links = div.findAll('a')
for a in links:
print(a['href'])
print(a.text)
将为您提供URL和文本。
答案 1 :(得分:1)
您可以遍历divs
,然后打印soup.contents
的元素:
s = """
<div class="AAA">Text of AAA<a href="......AAA/url">Display text of URL A</a> .
</div>
<div class="BBB">Text of BBB<a href="......BBB/url">Display text of URL B</a> .
</div>
<div class="CCC">Text of CCC</div>
<div class="DDD">Text of DDD</div>
"""
from bs4 import BeautifulSoup as soup
for _text, *_next in map(lambda x:x.contents, soup(s, 'html.parser').find_all('div')):
print(_text)
if _next:
print(_next[0].text)
print(_next[0]['href'])
输出:
Text of AAA
Display text of URL A
......AAA/url
Text of BBB
Display text of URL B
......BBB/url
Text of CCC
Text of DDD
答案 2 :(得分:1)
from bs4 import BeautifulSoup
html="""
<div class="AAA">Text of AAA<a href="......AAA/url">Display text of URL A</a></div>
<div class="BBB">Text of BBB<a href="......BBB/url">Display text of URL B</a></div>
<div class="CCC">Text of CCC</div>
<div class="DDD">Text of DDD</div>
"""
soup = BeautifulSoup(html, "lxml")
for div in soup.findAll('div'):
print(div.text)
try:
print(div.find('a').text)
print(div.find('a')["href"])
except AttributeError:
pass
输出
Text of AAADisplay text of URL A
Display text of URL A
......AAA/url
Text of BBBDisplay text of URL B
Display text of URL B
......BBB/url
Text of CCC
Text of DDD
答案 3 :(得分:0)
它更易于阅读,您也可以使用它来获得预期的输出
divs = soup.find_all('div')
for div in divs:
print(div.contents[0]) # Text of AAA
link = div.find('a')
if link:
print(link.text) # Display text of URL A
print(link['href']) # ......AAA/url
答案 4 :(得分:0)
谢谢,我制定了解决方案
for h in ans_kin:
links = ""
link = h.find('a')
if link:
for l in link:
links = h.text + link.get('href')
else:
links = h.text
answer_kin.append(links)