Scrapy。我没有结果

时间:2017-04-10 18:11:25

标签: python scrapy

我有工作计划。但我的文件json是空的。 我的节目应该收到纽约时报的所有文章。

class ParseSpider(CrawlSpider):
    name = "new"
    allowed_domains = ["www.nytimes.com"]
    start_urls = ["https://www.nytimes.com/section/world?WT.nav=page&action=click&contentCollection=World&module=HPMiniNav&pgtype=Homepage&region=TopBar"]

    rules = (
             Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//*[@id="story"]/div[3]/div[1]',)), callback="parse_items", follow= True),
             )

def parse_item(self, response):
    hxs = HtmlXPathSelector(response)
    l = parseLoader(parse_item(), hxs)

    l.add_value('url', response.url)
    l.add_xpath('name', '//*[@id="headline"]' % u"Название статьи:")
    l.add_xpath('text', '//*[@id="story"]/div[3]/div[1]' % u"Текст:")

我是改变计划。编辑:

    rules = (
             Rule(LinkExtractor(allow=(), restrict_xpaths=('//*[contains(@id,"story")]')), callback = 'parse_item'),
             )

def parse_item(self, response):
    l = parseLoader(response=response)

    l.add_value('url', response.url)
    l.add_xpath('name', '//*[@id="headline"]' % u"Название статьи:")
    l.add_xpath('text', '//*[@id="story"]/div[3]/div[1]' % u"Текст:")
    yield l.load_item()

1 个答案:

答案 0 :(得分:0)

好像你在parse_item方法中进行了无限递归 您不需要选择器,甚至不应使用HtmlXpathSelector。尝试:

def parse_item(self, response):
    l = parseLoader(response=response)

    l.add_value('url', response.url)
    l.add_xpath('name', '//*[@id="headline"]' % u"Название статьи:")
    l.add_xpath('text', '//*[@id="story"]/div[3]/div[1]' % u"Текст:")
    yield l.load_item()

编辑:似乎你的linkextractor规则没有提取任何东西 首先,您不应该使用SgmlLinkExtractor它已被弃用。其次你的xpath没有捕获任何东西,你使用的xpath太具体而且在某些情况下不正确。尝试:

LinkExtractor(allow=(), restrict_xpaths=('//*[contains(@id,"story")]',))

您可以在scrapy shell命令中调试并尝试此操作:

$ scrapy shell "https://www.nytimes.com/section/..."
from scrapy.linkextractor import LinkExtractor    
le = LinkExtractor(allow=(), restrict_xpaths=('//*[contains(@id,"story")]',))
le.extract_links(response)
# 20 results will be printed