这是我第一次尝试使用CrawlSpider
抓取网站,而我的悲伤是我的蜘蛛没有返回任何结果。我也是python的新手,所以如果我犯了任何明显的错误,请耐心等待我。
以下是我的代码:
from scrapy.settings import Settings
from scrapy.settings import default_settings
from selenium import webdriver
from urlparse import urlparse
import csv
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy import log
default_settings.DEPTH_LIMIT = 3
class MySpider(CrawlSpider):
def __init__(self,url,ofile,csvWriter):
self.url=url
self.driver=webdriver.PhantomJS('/usr/local/bin/phantomjs')
self.ofile=ofile
self.csvWriter=csvWriter
self.name = "jin"
self.start_urls = [url]
self.rules = [Rule(SgmlLinkExtractor(), callback='parse_website', follow=True)]
def parse_website(self,response):
url=self.url
driver=self.driver
csvWriter=self.csvWriter
ofile=self.ofile
self.log('A response from %s just arrived!' % response.url)
driver.get(url)
htmlSiteUrl = self.get_site_url(driver)
htmlImagesList=self.get_html_images_list(driver,url)
def get_site_url(self,driver):
url = driver.current_url
return url
def get_html_images_list(self,driver,url):
listOfimages = driver.find_elements_by_tag_name('img')
return listOfimages
driver.close()
with open('/Users/hyunjincho/Desktop/BCorp_Websites.csv') as ifile:
website_batch= csv.reader(ifile, dialect=csv.excel_tab)
ofile=open('/Users/hyunjincho/Desktop/results.csv','wb')
csvWriter = csv.writer(ofile,delimiter=' ')
for website in website_batch:
url = ''.join(website)
aSpider=MySpider(url,ofile,csvWriter)
ofile.close()
为什么我的蜘蛛不刮什么?我在代码中做错了什么?有人可以帮帮我吗?
答案 0 :(得分:1)
你不应该以这种方式发射蜘蛛,看看它是如何在出色的scrapy tutorial
中完成的 scrapy crawl jin
另外,如果您希望从外部文件中读取url / s,请参阅Scrapy read list of URLs from file to scrape?
最后,输出是通过创建items并使用已配置的pipelines处理它们,如果您希望将它们写入csv文件使用csv item exporter