Python网络抓取:以li,span标签获取内容

时间:2019-12-28 05:46:15

标签: python html web-scraping beautifulsoup

我有一个HTML文档,如下所示,page_soup是BeautifulSoup对象。我试图将数据刮到list元素内。元素如下所示:

<ul class>
<li>
    <a href="http://..." class=" ttip"> ...</a>
    <ul class="name">
        <li class="title ellipsis">
            <span class="display-name ">
                <a href="http://add_name" class=" ttip">Name</a>
            </span>
        </li>
        <li class="job ellipsis">
            "job A"
            <span class="delimiter"> | <\span>
            "job B"
            <span class="delimiter"> | <\span>
            "job C"
        </li>
        <li class="contribution ellipsis">
            <span class="display-title">
                <a href=add_title_A" class=" ttip">Contribution A</a>
                <span class="year">(2000)</span>
            </span>
            <span class="delimiter"> | <\span>
            <span class="display-title">
                <a href=add_title_B" class=" ttip">Contribution B</a>
                <span class="year">(2002)</span>
            </span>
        </li>
    </ul>
</li>


<li>...
</li>

我需要

{'Name', 'add_name', 'job A', 'job B', 'job C', 'add_title_A', 'Contribution A', 'year', 'add_title_B', 'Contribution B', 'year'}

我尝试使用以下命令获取“ add_name”,但不确定输出是否为空,尽管没有任何调试错误,但我不确定它们是否正确,我怀疑问题是否是由于我的网站注册过程引起的(该网站需要注册才能搜索结果),但是我主要担心的是,假设没有注册问题,我应该如何继续获取其余元素。

# html parsing
page_soup = soup(page_html, 'html.parser')
uClient.close()


for li_tag in page_soup.find_all('ul', {'class': 'name'}):
  for span_tag in li_tag.find_all('li', {'class': 'title ellipsis'}):
    spans = span_tag.find_all('span', {'class': 'display-name '})
    for span in spans:
        links = span.find_all('a')
        for link in links:
            print(link['href'])

1 个答案:

答案 0 :(得分:0)

这是什么

from simplified_scrapy.simplified_doc import SimplifiedDoc 
html = '''<ul class>
<li>
    <a href="http://..." class=" ttip"> ...</a>
    <ul class="name">
        <li class="title ellipsis">
            <span class="display-name ">
                <a href="http://add_name" class=" ttip">Name</a>
            </span>
        </li>
        <li class="job ellipsis">
            "job A"
            <span class="delimiter"> | <\span>
            "job B"
            <span class="delimiter"> | <\span>
            "job C"
        </li>
        <li class="contribution ellipsis">
            <span class="display-title">
                <a href=add_title_A" class=" ttip">Contribution A</a>
                <span class="year">(2000)</span>
            </span>
            <span class="delimiter"> | <\span>
            <span class="display-title">
                <a href=add_title_B" class=" ttip">Contribution B</a>
                <span class="year">(2002)</span>
            </span>
        </li>
    </ul>
</li>
'''
doc = SimplifiedDoc(html.replace('<\span>','</span>')) # change <\span> to </span>
ul = doc.getElement('ul',attr='class',value='name')
lis = ul.lis
a = lis[0].a
print (a.text,a.href)
print (lis[1].text.split('|'))
spans = lis[2].spans
for span in spans:
    a=span.a
    if not a: continue
    print (a.text,a.href)
    s = span.span
    print (s['class'],s.text)

结果:

Name http://add_name
['"job A" ', ' "job B" ', ' "job C"']
Contribution A add_title_A
year (2000)
Contribution B add_title_B
year (2002)