更新:我能够让这个移动,但它没有返回到子页面并再次迭代序列。 我想要提取的数据如下表所示:
表 date_1 | source_1 |链接到article_1 | date_2 | source_2 |链接到article_2 | 等....
我需要先收集date_1,source_1,然后进入该文章的链接并重复...
非常感谢任何帮助。 :)
from scrapy.spiders import BaseSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors import LinkExtractor
from dirbot.items import WebsiteLoader
from scrapy.http import Request
from scrapy.http import HtmlResponse
class DindexSpider(BaseSpider):
name = "dindex"
allowed_domains = ["newslookup.com"]
start_urls = [
"http://www.newslookup.com/Business/"
]
def parse_subpage(self, response):
self.log("Scraping: " + response.url)
il = response.meta['il']
time = response.xpath('//div[@id="update_data"]//td[@class="stime3"]//text()').extract()
il.add_value('publish_date', time)
yield il.load_item()
def parse(self, response):
self.log("Scraping: " + response.url)
hxs = HtmlXPathSelector(response)
sites = hxs.select('//td[@class="article"]')
for site in sites:
il = WebsiteLoader(response=response, selector=site)
il.add_xpath('name', 'a/text()')
il.add_xpath('url', 'a/@href')
yield Request("http://www.newslookup.com/Business/", meta={'il': il}, callback=self.parse_subpage)
答案 0 :(得分:0)
That's just because you need to use the CrawlSpider
class instead of the BaseSpider
:
from scrapy.spiders import CrawlSpider
class DindexSpider(CrawlSpider):
# ...