从嵌套的xpath中提取数据

时间:2016-07-19 09:28:51

标签: python xpath scrapy web-crawler

我是使用xpath的新手, 我想从this link

中提取每个标题,正文,链接,发布日期

everthing看起来没问题,但是在身体上没有,如何在嵌套的xPath上提取每个单体,谢谢之前:)

这是我的来源

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from thehack.items import ThehackItem
class MySpider(BaseSpider):
    name = "thehack"
    allowed_domains = ["thehackernews.com"]
    start_urls = ["http://thehackernews.com/search/label/mobile%20hacking"]
    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.xpath('//article[@class="post item module"]')
        items = []
        for titles in titles:
            item = ThehackItem()
            item['title'] = titles.select('span/h2/a/text()').extract()
            item['link'] = titles.select('span/h2/a/@href').extract()
        item['body'] = titles.select('span/div/div/div/div/a/div/text()').extract()
        item['date'] = titles.select('span/div/span/text()').extract()
            items.append(item)
        return items

任何人都可以解决身体肿块?只在身上...... 在交配之前谢谢 这里是网站检查元素的图片 enter image description here

1 个答案:

答案 0 :(得分:1)

我认为你在与选择者竞争的地方,对吧?我想你应该查看selectors的文档,那里有很多好的信息。在这个特定的例子中,使用css选择器,我认为它会是这样的:

class MySpider(scrapy.Spider):
    name = "thehack"
    allowed_domains = ["thehackernews.com"]
    start_urls = ["http://thehackernews.com/search/label/mobile%20hacking"]

    def parse(self, response):
        for article in response.css('article.post'):
            item = ThehackItem()
            item['title'] = article.css('.post-title>a::text').extract_first()
            item['link'] = article.css('.post-title>a::attr(href)').extract_first()
            item['body'] = ''. join(article.css('[id^=summary] *::text').extract()).strip()
            item['date'] = article.css('[itemprop="datePublished"]::attr(content)').extract_first()
            yield item

将它们更改为xpath选择器并且也可以查看ItemLoaders,这将是一个很好的练习,它们非常有用。