Scrapy Spider无法使用xpath提取网页内容

时间:2015-10-15 03:15:40

标签: python xpath web-crawler scrapy

我有scrapy蜘蛛,我正在使用xpath选择器来提取页面内容,请检查我出错的地方

protocol OptionalType {
    typealias W
    var optional: W? { get }
}

extension Optional: OptionalType {
    typealias W = Wrapped
    var optional: W? { return self }
}

extension Array where Element: OptionalType {
    func unwrap() -> [Element.W]? {
        return reduce(Optional<[Element.W]>([])) { acc, e in
            acc.flatMap { a in e.optional.map { a + [$0] } }
        }
    } 
}

1 个答案:

答案 0 :(得分:0)

您的代码存在很多问题,因此这是一种不同的方法。

我选择CrawlSpider来控制抓取过程。特别是从查询页面抓取name和从详细页面抓取故事。

我试图通过不深入(嵌套)表结构但寻找内容模式来简化XPath语句。因此,如果您想要提取故事......必须有一个故事的链接。

以下是测试代码(带注释):

# -*- coding: utf-8 -*-
import scrapy

class MyItem(scrapy.Item):
    name = scrapy.Field()
    story = scrapy.Field()

class MySpider(scrapy.Spider):

    name = 'medical'
    allowed_domains = ['yananow.org']
    start_urls = ['http://yananow.org/query_stories.php']

    def parse(self, response):

        rows = response.xpath('//a[contains(@href,"display_story")]')

        #loop over all links to stories
        for row in rows:
            myItem = MyItem() # Create a new item
            myItem['name'] = row.xpath('./text()').extract() # assign name from link
            story_url = response.urljoin(row.xpath('./@href').extract()[0]) # extract url from link
            request = scrapy.Request(url = story_url, callback = self.parse_detail) # create request for detail page with story
            request.meta['myItem'] = myItem # pass the item with the request
            yield request

    def parse_detail(self, response):
        myItem = response.meta['myItem'] # extract the item (with the name) from the response
        text_raw = response.xpath('//font[@size=3]//text()').extract() # extract the story (text)
        myItem['story'] = ' '.join(map(unicode.strip, text_raw)) # clean up the text and assign to item
        yield myItem # return the item