我是python的新手,但我正在尝试使用BeautifulSoup来创建一个Web scraper。我有一个带有名单列表的电子表格,我用它来生成一个网址,这将带我到一个带有数据表的网站。然后我尝试获取一些数据并用它填充电子表格。使用chrome中的开发人员工具,我看到我想要的信息在标签下。使用soup.select(tr)我试图生成这些标签的列表,然后我可以迭代以获得我想要的信息。
但是,此调用每次都会生成一个空列表。当我导航到代码生成的url时,我被带到网站上的正确页面,在那里我可以找到我感兴趣的标签和信息。但是当我打印(soup.prettify())时,我得到了一个非常的没有我感兴趣的标签或信息的html的压缩版本。
在这里,我发布了我的代码的相关部分,我试图获得的HTML片段和我得到的精简版本。对不起,很长的帖子,但我真诚地感谢任何帮助。
base_url = 'http://portal.vertnet.org/search?q=specificepithet:'
for x in range(1,list_length):
genus = sheet.cell(row = x, column = 2).value
epithet = sheet.cell(row = x, column = 3).value
url = base_url + str(epithet) + '+genus:' + str(genus) + '+hastissue:1'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
table_rows = soup.select('tr')
print(len(table_rows))
tot_entries = min(5, len(table_rows))
ents = 0
prev_museums = []
while ents < tot_entries:
for y in range(2, tot_entries+2):
for x in len(table_rows):
first_cell = soup.select('td')[0]
museum = first_cell.getText()
if museum not in prev_museums:
other_sheet.cell(row = x, column = y).value = first_cell
prev_museums += first_cell[0:5]
ents +=1
r.save('completetissuelist.xlsx')
我试图在多个tr标签中捕获第一个td标签。
<tr>
<!--
<td>CUMV Mammal specimens 21200</td>
-->
<td> CUMV Mammal specimens 21200</td>
<td>Mammalia: Sciurus carolinensis</td>
<td> United States, New York, Tompkins County: Ithaca, 505 Hector Street</td>
<td>Collector(s): Margaret Terrell; Preparator(s): Michi T. Schulenberg</td>
<td>female</td>
<!--<td> 2006</td>-->
<td>2006-03-29</td>
<td style="text-align:center">
<span class="glyphicon glyphicon-map-marker"></span>
</td>
<td style="text-align:center"></td> </tr>
最后,这是BeautifulSoup似乎正在解析的内容,减去免责声明。
<body>
<div id="holder">
<div id="main-spinner">
</div>
<div id="header">
<!--
DISCLAIMER
-->
</div>
<div id="content">
</div>
<div id="footer">
<!--
DISCLAIMER
-->
<footer class="footer">
<div class="container">
<p>
VertNet | Funding by
<a href="http://nsf.gov" target="_blank">
<img src="https://www.nsf.gov/images/logos/nsf2.gif" width="30px"/>
</a>
</p>
</div>
</footer>
</div>
</div>
<script data-main="/js/main.js" src="/js/lib/require.js">
</script>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-41203333-1', 'vertnet.org');
ga('send', 'pageview');
</script>
<script>
var $buoop = {c:2};
function $buo_f(){
var e = document.createElement("script");
e.src = "//browser-update.org/update.min.js";
document.body.appendChild(e);
};
try {document.addEventListener("DOMContentLoaded", $buo_f,false)}
catch(e){window.attachEvent("onload", $buo_f)}
</script>
</body>
再次,抱歉这个长度,但我真的很感激我能得到的任何帮助。
答案 0 :(得分:0)
搜索结果从XHR POST请求加载到http://portal.vertnet.org/service/rpc/record.search
端点。在您的代码中模仿此请求并解析JSON响应(不涉及HTML解析):
import json
import requests
specific_epiphet = "cedrorum"
genus = "Bombycilla"
url = 'http://portal.vertnet.org/service/rpc/record.search'
payload = {
"limit": 100,
"q": json.dumps(
{"keywords": ["specificepithet:" + specific_epiphet, "genus:" + genus, "hastissue:1"]}
)
}
res = requests.post(url,
json=payload,
headers={'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"})
data = res.json()
for item in data["items"]:
item_data = json.loads(item["json"])
print(item["id"] + " " + item_data["title"] + " " + item_data["scientificname"])
打印:
amnh/birds/dot-15423 AMNH Bird Collection Bombycilla cedrorum
amnh/birds/dot-15937 AMNH Bird Collection Bombycilla cedrorum
amnh/birds/dot-15938 AMNH Bird Collection Bombycilla cedrorum
amnh/birds/dot-15939 AMNH Bird Collection Bombycilla cedrorum
...
mvz/bird-specimens/http-arctos-database-museum-guid-mvz-bird-179106-seid-1065589 MVZ Bird Collection (Arctos) Bombycilla cedrorum
mvz/bird-specimens/http-arctos-database-museum-guid-mvz-bird-179116-seid-928935 MVZ Bird Collection (Arctos) Bombycilla cedrorum
mvz/bird-specimens/http-arctos-database-museum-guid-mvz-bird-179307-seid-1242383 MVZ Bird Collection (Arctos) Bombycilla cedrorum