scrapy python take args

时间:2017-09-28 14:40:19

标签: python scrapy

我想在url中为蜘蛛实现args。例如:

scrapy crawl test -a url="https://example.com"

之后我想自动获取start_urls并将其自动转换为domain_allowed。例如:

domain_allowed = ['example.com']

之后我想将示例这个词传递给mysql管道,在那里使用 domain_allowed中的示例来创建表格强>

这就是我现在所拥有的:

class Spider(BaseSpider):
    name = 'seeker'

    def __init__(self, *args, **kwargs):
        urls = kwargs.pop('urls', [])
        if urls:
            self.start_urls = urls.split(',')
        self.logger.info(self.start_urls)

        # take the arg "urls" and convert it to allowed_domains
        url = "".join(urls)
        self.allowed_domains = [url.split('/')[-1]]

        super(SeekerSpider, self).__init__(*args, **kwargs)


   # i have to use "domain" here and not inside the function parge_page or __init__
   domain = domain_allowed.replace(".", "_")  
   # create folder with the domain name

   def parse_page(self, response):
       ...

基本上我需要在函数外部使用 self.allowed_domains ...这就是我的问题......变量不接受它。

这是我的pipelines.py

的一部分
class MySQLPipeline(object):
    def __init__(self, *args, **kwargs):
        self.connect = pymysql.connect(...)
        self.cursor = self.connect.cursor()
        # print "Input the name of the table: "  <-- its commented
        # tablename = raw_input(" ")   <-- its commented
        date = datetime.datetime.now().strftime("%y_%m_%d_%H_%M")
        self.tablename = kwargs.pop('tbl', '')
        self.newname = self.tablename + "_" + date
        print self.newname
        # create a different way to create a tablename
        # importing the "allowed_domain" and strip it 
        # and give tablename

管道我已经这样做了..但它不好...我想从蜘蛛中取出 allowed_domain 并将其传递到此处并将其拆分为仅取名称没有 .com .whatever

的域名

提前谢谢

1 个答案:

答案 0 :(得分:0)

在我的手机上对格式化感到抱歉...

我会在过程项功能中使用spider对象: def process_item(self,item,spider):     spider.allowed_domains.replace( “ ”“ _”)