Passing domain into Scrapy Web crawler

时间:2017-04-09 23:25:25

标签: python scrapy scrapy-spider

I am building a web crawler which the user will enter the URL into a script that they run first, then said script runs crawler with the domain entered into it. I have some cleaning to do, however I need to get the protoype going. I have made the code and what happens is that the crawler script keeps asking for the URL. I have tried using the terminal commands to enter it however I don't think my code is compatible with that. Is there a better way to pass domains entered by an end users from another script?

# First script
import os

def userInput():
    user_input = raw_input("Please enter URL. Please do not include http://: ")
    os.system("scrapy runspider crawler_prod.py")

# Crawler Script

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.spider import BaseSpider

from run_first import userInput

userInput()

class InputSpider(CrawlSpider):
        name = "Input"
        user_input = ""
        allowed_domains = [user_input]
        start_urls = ["http://" + user_input + "/"]

        # allow=() is used to match all links
        rules = [
        Rule(SgmlLinkExtractor(allow=()), follow=True, callback='parse_item')
        ]

        def parse_item(self, response):
            x = HtmlXPathSelector(response)
            filename = "output.txt"
            open(filename, 'ab').write(response.url + "\n")

I run it just by running the first script through the terminal. Some help figuring out how to pass a domain as a variable would be good.

1 个答案:

答案 0 :(得分:1)

use the start_requests method instead of start_urls:

def start_requests(self):
    yield Request(url=self.user_input)

...

Also remove the allowed_domains class variable so the spider can allow all domains it needs.

That way you can just call the spider with scrapy crawl myspider -a user_input="http://example.com"