Python,scrapy:抓取链接然后遍历这些链接以刮取更多链接

时间:2015-05-31 14:10:17

标签: javascript python html selenium scrapy

我正在尝试创建一个专注于名为weedmaps.com的网站的scrapy蜘蛛。 Weedmaps使用googlemaps API根据州和区域位置信息生成有关药房的信息。我最终希望蜘蛛做的是从顶层开始,潜水员进入各州,刮掉这些州内的区域联系,一次进入一个区域链接,刮掉药房链接,然后进入药房链接一次一个,并抓取有关这些个别药房的具体信息。鉴于该网站是动态的,我一直在使用selenium来解释javascript。感谢这个网站的帮助,我已经能够分开区域链接和药房链接。当我尝试将它们组合起来时,蜘蛛只收集第一个区域链接,然后直接进入该区域链接以收集药房信息,然后结束。我还想以某种方式创建一个区域链接列表,以及将在蜘蛛运行时填充的药房链接。

任何有关如何实现这一目标的见解都会非常棒。下面是我到目前为止的代码,我似乎无法弄清楚这个问题。提前谢谢!

import scrapy
import time
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from scrapybot import __init__

class scrapybot_spider(scrapy.Spider):
    name = "scrapybot_spider"
    allowed_domains = ['https://weedmaps.com']
    regionlinks = []
    dispensarylinks = []
    start_urls = ["https://weedmaps.com/dispensaries/in/united-states/colorado"]

    #initialize the selenium webdriver via Firefox
    def __init__(self):
        self.browser = webdriver.Firefox()

    #scraping regional links in States
    def parse(self, response):
        self.browser.get('https://weedmaps.com/dispensaries/in/united-states/colorado')
        wait = WebDriverWait(self.browser, 10)
        wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.subregion a.region-ajax-link")))
        for region in self.browser.find_elements_by_css_selector("div.subregion a.region-ajax-link"):
            time.sleep(5)
            region = region.get_attribute("href")
            self.regionlinks.append(region)
            print regionlinks


    #scraping dispensary links within regional links from above
    def dispensaryparse(self, response):
        global region
        self.browser.get(region)
        wait = WebDriverWait(self.browser, 10)
        wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.dispensary div.name a")))
        for dispensary in self.browser.find_elements_by_css_selector("div.dispensary div.name a"):
            dispensary = dispensary.get_attribute("href")
            self.dispensarylinks.append(dispensary)
            print dispensary
            return dispensary 

0 个答案:

没有答案