我想在url中为蜘蛛实现args。例如:
scrapy crawl test -a url="https://example.com"
之后我想自动获取start_urls并将其自动转换为domain_allowed。例如:
domain_allowed = ['example.com']
之后我想将示例这个词传递给mysql管道,在那里使用 domain_allowed中的示例来创建表格强>
这就是我现在所拥有的:
class Spider(BaseSpider):
name = 'seeker'
def __init__(self, *args, **kwargs):
urls = kwargs.pop('urls', [])
if urls:
self.start_urls = urls.split(',')
self.logger.info(self.start_urls)
# take the arg "urls" and convert it to allowed_domains
url = "".join(urls)
self.allowed_domains = [url.split('/')[-1]]
super(SeekerSpider, self).__init__(*args, **kwargs)
# i have to use "domain" here and not inside the function parge_page or __init__
domain = domain_allowed.replace(".", "_")
# create folder with the domain name
def parse_page(self, response):
...
基本上我需要在函数外部使用 self.allowed_domains ...这就是我的问题......变量域不接受它。
这是我的pipelines.py
的一部分class MySQLPipeline(object):
def __init__(self, *args, **kwargs):
self.connect = pymysql.connect(...)
self.cursor = self.connect.cursor()
# print "Input the name of the table: " <-- its commented
# tablename = raw_input(" ") <-- its commented
date = datetime.datetime.now().strftime("%y_%m_%d_%H_%M")
self.tablename = kwargs.pop('tbl', '')
self.newname = self.tablename + "_" + date
print self.newname
# create a different way to create a tablename
# importing the "allowed_domain" and strip it
# and give tablename
管道我已经这样做了..但它不好...我想从蜘蛛中取出 allowed_domain 并将其传递到此处并将其拆分为仅取名称没有 .com 或 .whatever
的域名提前谢谢
答案 0 :(得分:0)
在我的手机上对格式化感到抱歉...
我会在过程项功能中使用spider对象: def process_item(self,item,spider): spider.allowed_domains.replace( “ ”“ _”)