Question

我已经在scrapy中使用了一个爬行器，但我想通过使用main方法来启动crwaling

import sys, getopt
import scrapy
from scrapy.spiders import Spider
from scrapy.http    import Request
import re

class TutsplusItem(scrapy.Item):
  title = scrapy.Field()



class MySpider(Spider):
  name = "tutsplus"
  allowed_domains   = ["bbc.com"]
  start_urls = ["http://www.bbc.com/"]

  def __init__(self, *args):
      try:
          opts, args = getopt.getopt(args, "hi:o:", ["ifile=", "ofile="])
      except getopt.GetoptError:
          print 'test.py -i <inputfile> -o <outputfile>'
          sys.exit(2)

      super(MySpider, self).__init__(self,*args)



  def parse(self, response):
    links = response.xpath('//a/@href').extract()


    # We stored already crawled links in this list
    crawledLinks = []

    # Pattern to check proper link
    # I only want to get the tutorial posts
   # linkPattern = re.compile("^\/tutorials\?page=\d+")


    for link in links:
      # If it is a proper link and is not checked yet, yield it to the Spider
      #if linkPattern.match(link) and not link in crawledLinks:
      if not link in crawledLinks:
        link = "http://www.bbc.com" + link
        crawledLinks.append(link)
        yield Request(link, self.parse)

    titles = response.xpath('//a[contains(@class, "media__link")]/text()').extract()
    count=0
    for title in titles:
      item = TutsplusItem()
      item["title"] = title
      print("Title is : %s" %title)
      yield item

而不是使用scrapy runspider Crawler.py arg1 arg2 我想有一个主要功能的单独课程，并从那里开始scrapy。怎么样？

Answer 1

有不同的方法来解决这个问题，但我建议如下：

在同一目录中有一个main.py文件，它将打开一个新进程并使用您需要的参数启动spider。

main.py文件将具有以下内容：

import subprocess
scrapy_command = 'scrapy runspider {spider_name} -a param_1="{param_1}"'.format(spider_name='your_spider', param_1='your_value')

process = subprocess.Popen(scrapy_command, shell=True)

使用此代码，您只需要调用主文件。

python main.py

希望它有所帮助。

从主要功能运行Scrapy爬虫

1 个答案: