Question

我想在url中为蜘蛛实现args。例如：

scrapy crawl test -a url="https://example.com"

之后我想自动获取start_urls并将其自动转换为domain_allowed。例如：

domain_allowed = ['example.com']

之后我想将示例这个词传递给mysql管道，在那里使用 domain_allowed中的示例来创建表格强>

这就是我现在所拥有的：

class Spider(BaseSpider): name = 'seeker' def __init__(self, *args, **kwargs): urls = kwargs.pop('urls', []) if urls: self.start_urls = urls.split(',') self.logger.info(self.start_urls) # take the arg "urls" and convert it to allowed_domains url = "".join(urls) self.allowed_domains = [url.split('/')[-1]] super(SeekerSpider, self).__init__(*args, **kwargs) # i have to use "domain" here and not inside the function parge_page or __init__ domain = domain_allowed.replace(".", "_") # create folder with the domain name def parse_page(self, response): ...

基本上我需要在函数外部使用 self.allowed_domains ...这就是我的问题......变量域不接受它。

这是我的pipelines.py
的一部分
class MySQLPipeline(object): def __init__(self, *args, **kwargs): self.connect = pymysql.connect(...) self.cursor = self.connect.cursor() # print "Input the name of the table: " <-- its commented # tablename = raw_input(" ") <-- its commented date = datetime.datetime.now().strftime("%y_%m_%d_%H_%M") self.tablename = kwargs.pop('tbl', '') self.newname = self.tablename + "_" + date print self.newname # create a different way to create a tablename # importing the "allowed_domain" and strip it # and give tablename

管道我已经这样做了..但它不好...我想从蜘蛛中取出 allowed_domain 并将其传递到此处并将其拆分为仅取名称没有 .com 或 .whatever
的域名
提前谢谢

Answer 1

在我的手机上对格式化感到抱歉...

我会在过程项功能中使用spider对象： def process_item（self，item，spider）： spider.allowed_domains.replace（ “ ”“ _”）

scrapy python take args

1 个答案: