我正在尝试创建一个专注于名为weedmaps.com的网站的scrapy蜘蛛。 Weedmaps使用googlemaps API根据州和区域位置信息生成有关药房的信息。我最终希望蜘蛛做的是从顶层开始,潜水员进入各州,刮掉这些州内的区域联系,一次进入一个区域链接,刮掉药房链接,然后进入药房链接一次一个,并抓取有关这些个别药房的具体信息。鉴于该网站是动态的,我一直在使用selenium来解释javascript。感谢这个网站的帮助,我已经能够分开区域链接和药房链接。当我尝试将它们组合起来时,蜘蛛只收集第一个区域链接,然后直接进入该区域链接以收集药房信息,然后结束。我还想以某种方式创建一个区域链接列表,以及将在蜘蛛运行时填充的药房链接。
任何有关如何实现这一目标的见解都会非常棒。下面是我到目前为止的代码,我似乎无法弄清楚这个问题。提前谢谢!
import scrapy
import time
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from scrapybot import __init__
class scrapybot_spider(scrapy.Spider):
name = "scrapybot_spider"
allowed_domains = ['https://weedmaps.com']
regionlinks = []
dispensarylinks = []
start_urls = ["https://weedmaps.com/dispensaries/in/united-states/colorado"]
#initialize the selenium webdriver via Firefox
def __init__(self):
self.browser = webdriver.Firefox()
#scraping regional links in States
def parse(self, response):
self.browser.get('https://weedmaps.com/dispensaries/in/united-states/colorado')
wait = WebDriverWait(self.browser, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.subregion a.region-ajax-link")))
for region in self.browser.find_elements_by_css_selector("div.subregion a.region-ajax-link"):
time.sleep(5)
region = region.get_attribute("href")
self.regionlinks.append(region)
print regionlinks
#scraping dispensary links within regional links from above
def dispensaryparse(self, response):
global region
self.browser.get(region)
wait = WebDriverWait(self.browser, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.dispensary div.name a")))
for dispensary in self.browser.find_elements_by_css_selector("div.dispensary div.name a"):
dispensary = dispensary.get_attribute("href")
self.dispensarylinks.append(dispensary)
print dispensary
return dispensary