我正在试图抓住tripadvisor's website。我使用了两种方法,第一种方法是使用 CrawlSpiders 和规则。对结果不太满意,我现在尝试使用 Selenium 来浏览每个链接。唯一的问题是分页问题。我希望selenium浏览器打开网页并浏览starturl中的每个链接,然后单击底部的下一页。到目前为止,我编写的代码仅用于提取所需内容:
self.driver.get(response.url)
div_val = self.driver.find_elements_by_xpath('//div[@class="tab_contents"]')
for link in div_val:
l = link.find_element_by_tag_name('a').get_attribute('href')
if re.match(r'http:\/\/www\.tripadvisor\.com\/Hotels\-g[\d]*\-Dominican\_Republic\-Hotels\.html',l):
link.click()
time.sleep(5)
try:
hotel_links = self.driver.find_elements_by_xpath('//div[@class="listing_title"]')
for hotel_link in hotel_links:
lnk = hotel_link.find_element_by_class_name('property_title').get_attribute('href')
except NoSuchElementException:
print 'elemenotfound
我现在陷入了对硒的分页。
答案 0 :(得分:1)
我认为CrawlSpider
和Selenium
的混合对您有用 -
for click in range(0,15):#clicking on next button for pagination
button = self.driver.xpath("/html/body/div[3]/div[7]/div[2]/div[7]/div[2]/div[1]/div[3]/div[2]/div/div/div[41]/div[2]/div/a")
button.click()
time.sleep(10)
for i in range(0,10):#range depends upon number of listings you can change it# for entering into the individual url using response
item['url'] = response.xpath('a[contains(@class,"property_title ")]/@href').extract()[i]
if item['url']:
if 'http://' not in item['url']:
item['url'] = urljoin(response.url, item['url'])
yield scrapy.Request(item['url'],
meta={'item': item},
callback=self.anchor_page)
def anchor_page(self, response):
old_item = response.request.meta['item']
data you want to scrape
yield old_item