无法使用BeautifulSoup

时间:2017-07-11 22:52:06

标签: python beautifulsoup bs4

我是python的新手,但我正在尝试使用BeautifulSoup来创建一个Web scraper。我有一个带有名单列表的电子表格,我用它来生成一个网址,这将带我到一个带有数据表的网站。然后我尝试获取一些数据并用它填充电子表格。使用chrome中的开发人员工具,我看到我想要的信息在标签下。使用soup.select(tr)我试图生成这些标签的列表,然后我可以迭代以获得我想要的信息。

但是,此调用每次都会生成一个空列表。当我导航到代码生成的url时,我被带到网站上的正确页面,在那里我可以找到我感兴趣的标签和信息。但是当我打印(soup.prettify())时,我得到了一个非常的没有我感兴趣的标签或信息的html的压缩版本。

在这里,我发布了我的代码的相关部分,我试图获得的HTML片段和我得到的精简版本。对不起,很长的帖子,但我真诚地感谢任何帮助。

base_url = 'http://portal.vertnet.org/search?q=specificepithet:'
for x in range(1,list_length):
    genus = sheet.cell(row = x, column = 2).value
    epithet = sheet.cell(row = x, column = 3).value
    url = base_url + str(epithet) + '+genus:' + str(genus) + '+hastissue:1'
    res = requests.get(url)
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text, 'html.parser')
    table_rows = soup.select('tr')                             
    print(len(table_rows))
    tot_entries = min(5, len(table_rows))
    ents = 0
    prev_museums = []
    while ents < tot_entries:
        for y in range(2, tot_entries+2):
            for x in len(table_rows):
                first_cell = soup.select('td')[0]
                museum = first_cell.getText()
                if museum not in prev_museums:
                    other_sheet.cell(row = x, column = y).value = first_cell
                    prev_museums += first_cell[0:5]
                    ents +=1
r.save('completetissuelist.xlsx')

我试图在多个tr标签中捕获第一个td标签。

<tr>

<!--
<td>CUMV Mammal specimens 21200</td>
-->
<td> CUMV Mammal specimens 21200</td>
<td>Mammalia: Sciurus carolinensis</td>
<td> United States, New York, Tompkins County: Ithaca, 505 Hector Street</td>
<td>Collector(s): Margaret Terrell; Preparator(s): Michi T. Schulenberg</td>
<td>female</td>
<!--<td> 2006</td>-->
<td>2006-03-29</td>
<td style="text-align:center">
        <span class="glyphicon glyphicon-map-marker"></span>
    </td>   
<td style="text-align:center"></td> </tr>

最后,这是BeautifulSoup似乎正在解析的内容,减去免责声明。

 <body>
  <div id="holder">
   <div id="main-spinner">
   </div>
   <div id="header">
    <!-- 
DISCLAIMER
-->
   </div>
   <div id="content">
   </div>
   <div id="footer">
    <!-- 
  DISCLAIMER
-->
    <footer class="footer">
     <div class="container">
      <p>
       VertNet | Funding by
       <a href="http://nsf.gov" target="_blank">
        <img src="https://www.nsf.gov/images/logos/nsf2.gif" width="30px"/>
       </a>
      </p>
     </div>
    </footer>
   </div>
  </div>
  <script data-main="/js/main.js" src="/js/lib/require.js">
  </script>
  <script>
   (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
      (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
      m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
      })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
      ga('create', 'UA-41203333-1', 'vertnet.org');
      ga('send', 'pageview');
  </script>
  <script>
   var $buoop = {c:2}; 
    function $buo_f(){ 
     var e = document.createElement("script"); 
     e.src = "//browser-update.org/update.min.js"; 
     document.body.appendChild(e);
    };
    try {document.addEventListener("DOMContentLoaded", $buo_f,false)}
    catch(e){window.attachEvent("onload", $buo_f)}
  </script>
 </body>

再次,抱歉这个长度,但我真的很感激我能得到的任何帮助。

1 个答案:

答案 0 :(得分:0)

搜索结果从XHR POST请求加载到http://portal.vertnet.org/service/rpc/record.search端点。在您的代码中模仿此请求并解析JSON响应(不涉及HTML解析):

import json
import requests


specific_epiphet = "cedrorum"
genus = "Bombycilla"
url = 'http://portal.vertnet.org/service/rpc/record.search'
payload = {
    "limit": 100,
    "q": json.dumps(
        {"keywords": ["specificepithet:" + specific_epiphet, "genus:" + genus, "hastissue:1"]}
    )
}

res = requests.post(url,
                    json=payload,
                    headers={'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"})

data = res.json()
for item in data["items"]:
    item_data = json.loads(item["json"])
    print(item["id"] + " " + item_data["title"] + " " + item_data["scientificname"])

打印:

amnh/birds/dot-15423 AMNH Bird Collection Bombycilla cedrorum
amnh/birds/dot-15937 AMNH Bird Collection Bombycilla cedrorum
amnh/birds/dot-15938 AMNH Bird Collection Bombycilla cedrorum
amnh/birds/dot-15939 AMNH Bird Collection Bombycilla cedrorum
...
mvz/bird-specimens/http-arctos-database-museum-guid-mvz-bird-179106-seid-1065589 MVZ Bird Collection (Arctos) Bombycilla cedrorum
mvz/bird-specimens/http-arctos-database-museum-guid-mvz-bird-179116-seid-928935 MVZ Bird Collection (Arctos) Bombycilla cedrorum
mvz/bird-specimens/http-arctos-database-museum-guid-mvz-bird-179307-seid-1242383 MVZ Bird Collection (Arctos) Bombycilla cedrorum