I am building a web crawler which the user will enter the URL into a script that they run first, then said script runs crawler with the domain entered into it. I have some cleaning to do, however I need to get the protoype going. I have made the code and what happens is that the crawler script keeps asking for the URL. I have tried using the terminal commands to enter it however I don't think my code is compatible with that. Is there a better way to pass domains entered by an end users from another script?
# First script
import os
def userInput():
user_input = raw_input("Please enter URL. Please do not include http://: ")
os.system("scrapy runspider crawler_prod.py")
# Crawler Script
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from run_first import userInput
userInput()
class InputSpider(CrawlSpider):
name = "Input"
user_input = ""
allowed_domains = [user_input]
start_urls = ["http://" + user_input + "/"]
# allow=() is used to match all links
rules = [
Rule(SgmlLinkExtractor(allow=()), follow=True, callback='parse_item')
]
def parse_item(self, response):
x = HtmlXPathSelector(response)
filename = "output.txt"
open(filename, 'ab').write(response.url + "\n")
I run it just by running the first script through the terminal. Some help figuring out how to pass a domain as a variable would be good.
答案 0 :(得分:1)
use the start_requests
method instead of start_urls
:
def start_requests(self):
yield Request(url=self.user_input)
...
Also remove the allowed_domains
class variable so the spider can allow all domains it needs.
That way you can just call the spider with scrapy crawl myspider -a user_input="http://example.com"