我有一个HTML文档,如下所示,page_soup是BeautifulSoup对象。我试图将数据刮到list元素内。元素如下所示:
<ul class>
<li>
<a href="http://..." class=" ttip"> ...</a>
<ul class="name">
<li class="title ellipsis">
<span class="display-name ">
<a href="http://add_name" class=" ttip">Name</a>
</span>
</li>
<li class="job ellipsis">
"job A"
<span class="delimiter"> | <\span>
"job B"
<span class="delimiter"> | <\span>
"job C"
</li>
<li class="contribution ellipsis">
<span class="display-title">
<a href=add_title_A" class=" ttip">Contribution A</a>
<span class="year">(2000)</span>
</span>
<span class="delimiter"> | <\span>
<span class="display-title">
<a href=add_title_B" class=" ttip">Contribution B</a>
<span class="year">(2002)</span>
</span>
</li>
</ul>
</li>
<li>...
</li>
我需要
{'Name', 'add_name', 'job A', 'job B', 'job C', 'add_title_A', 'Contribution A', 'year', 'add_title_B', 'Contribution B', 'year'}
我尝试使用以下命令获取“ add_name”,但不确定输出是否为空,尽管没有任何调试错误,但我不确定它们是否正确,我怀疑问题是否是由于我的网站注册过程引起的(该网站需要注册才能搜索结果),但是我主要担心的是,假设没有注册问题,我应该如何继续获取其余元素。
# html parsing
page_soup = soup(page_html, 'html.parser')
uClient.close()
for li_tag in page_soup.find_all('ul', {'class': 'name'}):
for span_tag in li_tag.find_all('li', {'class': 'title ellipsis'}):
spans = span_tag.find_all('span', {'class': 'display-name '})
for span in spans:
links = span.find_all('a')
for link in links:
print(link['href'])
答案 0 :(得分:0)
这是什么
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''<ul class>
<li>
<a href="http://..." class=" ttip"> ...</a>
<ul class="name">
<li class="title ellipsis">
<span class="display-name ">
<a href="http://add_name" class=" ttip">Name</a>
</span>
</li>
<li class="job ellipsis">
"job A"
<span class="delimiter"> | <\span>
"job B"
<span class="delimiter"> | <\span>
"job C"
</li>
<li class="contribution ellipsis">
<span class="display-title">
<a href=add_title_A" class=" ttip">Contribution A</a>
<span class="year">(2000)</span>
</span>
<span class="delimiter"> | <\span>
<span class="display-title">
<a href=add_title_B" class=" ttip">Contribution B</a>
<span class="year">(2002)</span>
</span>
</li>
</ul>
</li>
'''
doc = SimplifiedDoc(html.replace('<\span>','</span>')) # change <\span> to </span>
ul = doc.getElement('ul',attr='class',value='name')
lis = ul.lis
a = lis[0].a
print (a.text,a.href)
print (lis[1].text.split('|'))
spans = lis[2].spans
for span in spans:
a=span.a
if not a: continue
print (a.text,a.href)
s = span.span
print (s['class'],s.text)
结果:
Name http://add_name
['"job A" ', ' "job B" ', ' "job C"']
Contribution A add_title_A
year (2000)
Contribution B add_title_B
year (2002)