Question

更新：我能够让这个移动，但它没有返回到子页面并再次迭代序列。我想要提取的数据如下表所示：

我需要先收集date_1，source_1，然后进入该文章的链接并重复...

非常感谢任何帮助。：）

from scrapy.spiders import BaseSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors import LinkExtractor
from dirbot.items import WebsiteLoader
from scrapy.http import Request
from scrapy.http import HtmlResponse



class DindexSpider(BaseSpider):
name = "dindex"
allowed_domains = ["newslookup.com"]
start_urls = [
      "http://www.newslookup.com/Business/"
]

def parse_subpage(self, response):
    self.log("Scraping: " + response.url)
    il = response.meta['il']
    time = response.xpath('//div[@id="update_data"]//td[@class="stime3"]//text()').extract()
    il.add_value('publish_date', time)
    yield il.load_item()


def parse(self, response):
    self.log("Scraping: " + response.url)
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//td[@class="article"]')

    for site in sites:
        il = WebsiteLoader(response=response, selector=site)
        il.add_xpath('name', 'a/text()')
        il.add_xpath('url', 'a/@href')
        yield Request("http://www.newslookup.com/Business/", meta={'il': il}, callback=self.parse_subpage)

Answer 1

That's just because you need to use the CrawlSpider class instead of the BaseSpider:

from scrapy.spiders import CrawlSpider

class DindexSpider(CrawlSpider):
    # ...

Scrapy，从第1页解析项目，然后点击链接获取其他项目

1 个答案: