Question

我正在从a site抓取数据。我来自俄罗斯，当我使用我的标准IP并转到网址时，页面显示错误，没有数据。但是，当我使用英国代理时，它没关系。

这就是为什么我必须在抓取时使用代理但我遇到一个奇怪的问题。当我尝试通过浏览器转到http://www.indeed.com/resumes/data-scientist/in-london?co=GB&start=1000时，它可以工作（它包含数据）。但是，当我用脚本执行此操作时，它以其他方式表示。

出于某种原因，我的解析器不代表来自http://www.indeed.com/resumes/data-scientist/in-london?co=GB&start=1000的页面，因为我可以通过浏览器看到它们。

例如，http://www.indeed.com/resumes/data-scientist/in-london?co=GB&start=950的差异开始的html代码：

通过浏览器（根据我的需要）：

<div id="pagination">Page:<a class="instl confirm-nav previous" rel="nofollow" href="?q=data+scientist&amp;l=london&amp;co=GB&amp;start=900">« Previous</a><a class="instl confirm-nav" rel="nofollow" href="?q=data+scientist&amp;l=london&amp;co=GB&amp;start=850">18</a><a class="instl confirm-nav" rel="nofollow" href="?q=data+scientist&amp;l=london&amp;co=GB&amp;start=900">19</a><span class="current_page">20</span><a class="instl confirm-nav" rel="nofollow" href="?q=data+scientist&amp;l=london&amp;co=GB&amp;start=1000">21</a><a class="instl confirm-nav" rel="nofollow" href="?q=data+scientist&amp;l=london&amp;co=GB&amp;start=1050">22</a><a class="instl confirm-nav" rel="nofollow" href="?q=data+scientist&amp;l=london&amp;co=GB&amp;start=1100">23</a><a class="instl confirm-nav" rel="nofollow" href="?q=data+scientist&amp;l=london&amp;co=GB&amp;start=1150">24</a><a class="instl confirm-nav next" rel="nofollow" href="?q=data+scientist&amp;l=london&amp;co=GB&amp;start=1000">Next »</a></div><div id="footer" class=""><p id="footer_nav" class="footer_nav">

与解析器相同的地方（错误）：

</div><div id="pagination">Page:<a class="instl confi
rm-nav previous" href="?q=data+scientist&amp;l=london&amp;co=GB&amp;start=900" rel="nofollow">< Previous</a><a class="in
stl confirm-nav" href="?q=data+scientist&amp;l=london&amp;co=GB&amp;start=850" rel="nofollow">18</a><a class="instl conf
irm-nav" href="?q=data+scientist&amp;l=london&amp;co=GB&amp;start=900" rel="nofollow">19</a><span class="current_page">2
0</span></div><div class="" id="footer"><p class="footer_nav" id="footer_nav">

我在Win7上，使用Python3和BeautifulSoup。

代码：

from bs4 import BeautifulSoup
import requests

proxy = {"http": "http://134.213.145.228:8080"}
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
page_url = 'http://www.indeed.com/resumes/data-scientist/in-london?co=GB&start=950'
req = requests.get(page_url, proxies=proxy, headers=headers)
req.encoding = 'utf-8'
main = BeautifulSoup(req.text, 'html.parser')
profile_urls_tag = main.find_all('a', class_="app_link")

Edited1：

一个有趣的想法我觉得它的问题在于它。当我在Mozilla中使用相同的代理时，我只能看到20页但使用Chrome - 40。

Edited2： 问题已经解决了。看来我必须注册并登录才能看到完整的信息。

Scraper无法通过代理获取最新页面

0 个答案: