我正在尝试在Python脚本中运行Scrapy。这是相关的代码:
import scrapy
from scrapy.crawler import CrawlerProcess
class PostSpider(scrapy.Spider):
name = "post crawler"
allowed_domains = ['test.com']
def __init__(self, **kwargs):
super(PostSpider, self).__init__(**kwargs)
url = kwargs.get('url')
print(url)
self.start_urls = ['https://www.test.com/wp-json/test/2.0/posts' + url]
def parse(self, response):
post = json.loads(response.body_as_unicode())
post = post["content"]
return post
posts = GA.retrieve(TIA.start_date, TIA.end_date, "content type auto")
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
for post in posts:
post_url = post[2]
process.crawl(PostSpider(url=post_url))
process.start()
我试图在某种程度上遵循指南here和here。但是,我无法让它发挥作用。这是我收到的错误消息:
Unhandled error in Deferred:
2016-03-25 20:49:43 [twisted] CRITICAL: Unhandled error in Deferred:
Traceback (most recent call last):
File "text_analysis.py", line 48, in <module>
process.crawl(PostSpider(url=post_url))
File "/Users/terence/TIA/lib/python3.5/site-packages/scrapy/crawler.py", line 163, in crawl
return self._crawl(crawler, *args, **kwargs)
File "/Users/terence/TIA/lib/python3.5/site-packages/scrapy/crawler.py", line 167, in _crawl
d = crawler.crawl(*args, **kwargs)
File "/Users/terence/TIA/lib/python3.5/site-packages/twisted/internet/defer.py", line 1274, in unwindGenerator
return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
File "/Users/terence/TIA/lib/python3.5/site-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
result = g.send(result)
File "/Users/terence/TIA/lib/python3.5/site-packages/scrapy/crawler.py", line 71, in crawl
self.spider = self._create_spider(*args, **kwargs)
File "/Users/terence/TIA/lib/python3.5/site-packages/scrapy/crawler.py", line 94, in _create_spider
return self.spidercls.from_crawler(self, *args, **kwargs)
File "/Users/terence/TIA/lib/python3.5/site-packages/scrapy/spiders/__init__.py", line 50, in from_crawler
spider = cls(*args, **kwargs)
File "text_analysis.py", line 17, in __init__
self.start_urls = ['https://www.techinasia.com/wp-json/techinasia/2.0/posts' + url]
builtins.TypeError: Can't convert 'NoneType' object to str implicitly
2016-03-25 20:49:43 [twisted] CRITICAL:
/xiaomi-still-got-it-bitches
2016-03-25 20:49:43 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.logstats.LogStats']
我似乎无法弄清楚出了什么问题。
答案 0 :(得分:2)
process.crawl()
的来电必须是
process.crawl(PostSpider, url=post_url)
,因为definition是
抓取(crawler_or_spidercls,* args,** kwargs)
它期望spider类(不是实例化对象)作为第一个参数。以下所有位置和关键字参数(*args, **kwargs
)都会传递给spider init函数。
答案 1 :(得分:1)