我想抓取所有exibitors的这个页面:
https://greenbuildexpo.com/Attendee/Expohall/Exhibitors
但是scrapy没有加载内容,我现在正在做的是使用selenium来加载页面并使用scrapy搜索链接:
url = 'https://greenbuildexpo.com/Attendee/Expohall/Exhibitors'
driver_1 = webdriver.Firefox()
driver_1.get(url)
content = driver_1.page_source
response = TextResponse(url='',body=content,encoding='utf-8')
print len(set(response.xpath('//*[contains(@href,"Attendee/")]//@href').extract()))
当按下“下一步”按钮时,该网站似乎没有提出任何新请求,所以我希望将所有链接放在一个,但我只获得43个链接。它们应该在500左右。
现在我按“下一步”按钮试图抓取页面:
for i in range(10):
xpath = '//*[@id="pagingNormalView"]/ul/li[15]'
driver_1.find_element_by_xpath(xpath).click()
但是我收到了一个错误:
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/errorhandler.py", line 192, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: {"method":"xpath","selector":"//*[@id=\"pagingNormalView\"]/ul/li[15]"}
Stacktrace:
答案 0 :(得分:3)
您不需要selenium
,有一个XHR请求可以让所有参展商,模拟它,从Scrapy Shell演示:
$ scrapy shell https://greenbuildexpo.com/Attendee/Expohall/Exhibitors
In [1]: fetch("https://greenbuildexpo.com/Attendee/ExpoHall/GetAllExhibitors")
2016-10-13 12:45:46 [scrapy] DEBUG: Crawled (200) <GET https://greenbuildexpo.com/Attendee/ExpoHall/GetAllExhibitors> (referer: None)
In [2]: import json
In [3]: data = json.loads(response.body)
In [4]: len(data["Data"])
Out[4]: 541
# printing booth number for demonstration purposes
In [5]: for item in data["Data"]:
...: print(item["BoothNumber"])
...:
2309
2507
...
1243
2203
943