例如,我想从下面的列表中只抽出Child1,Child2和Child3,它位于h3的第一个实例之后和h3的下一个标记之前
<h3>HeaderName1<h3>
<ul class="prodoplist">
<li>Parent</li>
<li class="lev1">Child1</li>
<li class="lev1">Child2</li>
<li class="lev1">Child3</li>
</ul>
<h3>HeaderName2<h3>
<ul class="prodoplist">
<li>Parent2</li>
<li class="lev1">Child4</li>
<li class="lev1">Child5</li>
<li class="lev1">Child6</li>
</ul>
答案 0 :(得分:2)
这应该有效。
import re
from BeautifulSoup import BeautifulSoup
html_doc = '<h3>HeaderName1</h3><ul class="prodoplist"><li>Parent</li><li class="lev1">Child1</li><li class="lev1">Child2</li><li class="lev1">Child3</li></ul> <h3>HeaderName2</h3><ul class="prodoplist"><li>Parent2</li><li class="lev1">Child4</li><li class="lev1">Child5</li><li class="lev1">Child6</li></ul>'
m = re.search(r'<h3>.*?<h3>', html_doc, re.DOTALL)
s = m.start()
e = m.end() - len('<h3>')
target_html = html_doc[s:e]
new_bs = BeautifulSoup(target_html)
ul_eles = new_bs.findAll('ul', attrs={'class' : 'prodoplist'})
for ul_ele in ul_eles:
li_eles = new_bs.findAll('li', attrs={'class' : 'lev1'})
for li_ele in li_eles:
print li_ele.text
答案 1 :(得分:2)
使用findChildren,如:
for ul in soup.find_all('ul'):
print 'ul start'
for idx, li in enumerate(ul.findChildren('li')):
if idx in range(3):
print li
输出:
ul start
<li>Parent</li>
<li class="lev1">Child1</li>
<li class="lev1">Child2</li>
ul start
<li>Parent2</li>
<li class="lev1">Child4</li>
<li class="lev1">Child5</li>
但是,在大多数情况下,lxml and xpath是一个更好的解决方案:
from lxml import html
doc = html.parse('input.html')
print [ul.xpath('li[1] | li[2] | li[3]') for ul in doc.xpath('//ul')]
答案 2 :(得分:1)
import requests
from BeautifulSoup import BeautifulSoup
children = []
url = "http://someurl.html"
r = requests.get(url)
bs = BeautifulSoup(r.text)
for uls in bs.findAll('ul', 'prodoplist'):
lis = uls.findAll('li', 'lev1')
for li in lis:
children.append(li.text)
print children