将输入文件传递给scrapy,其中包含要扫描的域列表

时间:2013-12-20 11:33:01

标签: python scrapy

我看到了这个链接[链接](Pass Scrapy Spider a list of URLs to crawl via .txt file)! 这会更改开始网址列表。我想为每个域(从文件)刮取网页,并将结果放入一个单独的文件(以域名命名)。 我已经删除了网站的数据,但我已经在蜘蛛本身中指定了启动URL和allowed_domains。如何使用输入文件更改此设置。

更新1:

这是我尝试过的代码:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field

class AppleItem(Item):
    reference_link = Field()
    rss_link = Field()

class AppleSpider(CrawlSpider):

    name = 'apple'
    allowed_domains = []
    start_urls = []

    def __init__(self):
        for line in open('./domains.txt', 'r').readlines():
            self.allowed_domains.append(line)
            self.start_urls.append('http://%s' % line)

    rules = [Rule(SgmlLinkExtractor(allow=()), follow=True, callback='parse_item')]

    def parse_item(self, response):
        sel = HtmlXPathSelector(response)
        rsslinks = sel.select('//a[contains(@href, "pdf")]/@href').extract()
        items = []
        for rss in rsslinks:
          item = AppleItem()
          item['reference_link'] = response.url
          item['rss_link'] = rsslinks
          items.append(item)
        filename = response.url.split("/")[-2]
        open(filename+'.csv', 'wb').write(items)

运行时遇到错误:AttributeError:'AppleSpider'对象没有属性'_rules'

1 个答案:

答案 0 :(得分:4)

您可以使用__init__蜘蛛类方法来读取文件并owerrite start_urlsallowed_domains

假设我们的文件domains.txt包含内容:

example1.com
example2.com
...

实施例

class MySpider(BaseSpider):
    name = "myspider"
    allowed_domains = []
    start_urls = []

    def __init__(self):
        for line in open('./domains.txt', 'r').readlines():
            self.allowed_domains.append(line)
            self.start_urls.append('http://%s' % line)

    def parse(self, response):
        # here you will get data parsing page
        # than put your data into single file
        # from scrapy toturial http://doc.scrapy.org/en/latest/intro/tutorial.html
        filename = response.url.split("/")[-2]
        open(filename, 'wb').write(your_data)