我有工作计划。但我的文件json是空的。 我的节目应该收到纽约时报的所有文章。
class ParseSpider(CrawlSpider):
name = "new"
allowed_domains = ["www.nytimes.com"]
start_urls = ["https://www.nytimes.com/section/world?WT.nav=page&action=click&contentCollection=World&module=HPMiniNav&pgtype=Homepage®ion=TopBar"]
rules = (
Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//*[@id="story"]/div[3]/div[1]',)), callback="parse_items", follow= True),
)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
l = parseLoader(parse_item(), hxs)
l.add_value('url', response.url)
l.add_xpath('name', '//*[@id="headline"]' % u"Название статьи:")
l.add_xpath('text', '//*[@id="story"]/div[3]/div[1]' % u"Текст:")
我是改变计划。编辑:
rules = (
Rule(LinkExtractor(allow=(), restrict_xpaths=('//*[contains(@id,"story")]')), callback = 'parse_item'),
)
def parse_item(self, response):
l = parseLoader(response=response)
l.add_value('url', response.url)
l.add_xpath('name', '//*[@id="headline"]' % u"Название статьи:")
l.add_xpath('text', '//*[@id="story"]/div[3]/div[1]' % u"Текст:")
yield l.load_item()
答案 0 :(得分:0)
好像你在parse_item
方法中进行了无限递归
您不需要选择器,甚至不应使用HtmlXpathSelector
。尝试:
def parse_item(self, response):
l = parseLoader(response=response)
l.add_value('url', response.url)
l.add_xpath('name', '//*[@id="headline"]' % u"Название статьи:")
l.add_xpath('text', '//*[@id="story"]/div[3]/div[1]' % u"Текст:")
yield l.load_item()
编辑:似乎你的linkextractor规则没有提取任何东西
首先,您不应该使用SgmlLinkExtractor
它已被弃用。其次你的xpath没有捕获任何东西,你使用的xpath太具体而且在某些情况下不正确。尝试:
LinkExtractor(allow=(), restrict_xpaths=('//*[contains(@id,"story")]',))
您可以在scrapy shell
命令中调试并尝试此操作:
$ scrapy shell "https://www.nytimes.com/section/..."
from scrapy.linkextractor import LinkExtractor
le = LinkExtractor(allow=(), restrict_xpaths=('//*[contains(@id,"story")]',))
le.extract_links(response)
# 20 results will be printed