我想使用名为BeautifulSoup的库来抓取网站的内容。
代码:
from bs4 import BeautifulSoup
from urllib.request import urlopen
html_http_response = urlopen("http://www.airlinequality.com/airport-reviews/jeddah-airport/")
data = html_http_response.read()
soup = BeautifulSoup(data, "html.parser")
print(soup.prettify())
输出:
<html style="height:100%">
<head>
<meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/>
<meta content="telephone=no" name="format-detection"/>
<meta content="initial-scale=1.0" name="viewport"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
</head>
<body style="margin:0px;height:100%">
<iframe frameborder="0" height="100%" marginheight="0px" marginwidth="0px" src="/_Incapsula_Resource?CWUDNSAI=9&xinfo=9-57435048-0%200NNN%20RT%281512733380259%202%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%284%2c315%2c0%29%20U19&incident_id=466002040110357581-305794245507288265&edet=12&cinfo=04000000" width="100%">
Request unsuccessful. Incapsula incident ID: 466002040110357581-305794245507288265
</iframe>
</body>
</html>
正文包含iFrame balise,而不是从浏览器检查内容时显示的内容。
答案 0 :(得分:4)
本网站使用cookie来验证请求。如果您是第一次访问该网站,则需要选中I'm not Robot
选项。所以它在请求的标题上传递incap_ses_415_965359,PHPSESSID,visid_incap_965359,_ga和_gid值并发送它。
所以,我从chrome dev工具中获取了cookie并将其保存在字典中。
from bs4 import BeautifulSoup
import requests
cookies = {
'incap_ses_415_965359':'djRha9OqhshstDcXvPV8cmHCBQGBKloAAAAAN3/D9dvoqwEc7GPEwefkhQ==', 'PHPSESSID':'fjmr7plc0dmocm8roq7togcp92', 'visid_incap_965359':'akteT8lDT1iyST7XJO7wdQGBKloAAAns;aAAQkIPAAAAAACAWbWAAQ6Ozzrln35KG6DhLXMRYnMjxOmY', '_ga':'GA1.2.894579844.151uus2734989', '_gid':"GA1.2.1055878562.1598994989"
}
html_http_response = requests.get("http://www.airlinequality.com/airport-reviews/jeddah-airport", cookies=cookies)
data = html_http_response.text
soup = BeautifulSoup(data, "html.parser")
print(soup.prettify())
从浏览器中获取Cookie值并进行更新
答案 1 :(得分:0)
您要查找的数据尚不存在,因此此页面包含Java Jenerated Data。你必须学习硒库,你会发现它(这很容易)。 这意味着您只想在实际加载页面时创建所需的数据并单击例如搜索按钮。(请记住,首先必须在iframe中选择它们。)