我正在写一个scrapy蜘蛛从主页上抓取今天的NYT文章,但由于某种原因它没有跟随任何链接。当我在scrapy shell http://www.nytimes.com
中实例化链接提取器时,它成功提取了一个包含le.extract_links(response)
的文章网址列表,但我无法获取我的抓取命令(scrapy crawl nyt -o out.json
)来抓取任何内容主页。我的智慧结束了。是因为主页不会从解析函数中产生文章吗?任何帮助是极大的赞赏。
from datetime import date
import scrapy
from scrapy.contrib.spiders import Rule
from scrapy.contrib.linkextractors import LinkExtractor
from ..items import NewsArticle
with open('urls/debug/nyt.txt') as debug_urls:
debug_urls = debug_urls.readlines()
with open('urls/release/nyt.txt') as release_urls:
release_urls = release_urls.readlines() # ["http://www.nytimes.com"]
today = date.today().strftime('%Y/%m/%d')
print today
class NytSpider(scrapy.Spider):
name = "nyt"
allowed_domains = ["nytimes.com"]
start_urls = release_urls
rules = (
Rule(LinkExtractor(allow=(r'/%s/[a-z]+/.*\.html' % today, )),
callback='parse', follow=True),
)
def parse(self, response):
article = NewsArticle()
for story in response.xpath('//article[@id="story"]'):
article['url'] = response.url
article['title'] = story.xpath(
'//h1[@id="story-heading"]/text()').extract()
article['author'] = story.xpath(
'//span[@class="byline-author"]/@data-byline-name'
).extract()
article['published'] = story.xpath(
'//time[@class="dateline"]/@datetime').extract()
article['content'] = story.xpath(
'//div[@id="story-body"]/p//text()').extract()
yield article
答案 0 :(得分:3)
我找到了解决问题的方法。我做错了两件事:
CrawlSpider
而不是Spider
。CrawlSpider
时,我需要使用回调函数而不是覆盖parse
。根据文档,覆盖parse
会中断CrawlSpider
功能。