如何组合XPath?

时间:2016-10-10 10:17:06

标签: xpath scrapy

我有HTML元素,如下所示:

enter image description here

我想对h1div.article-metadiv.article-content进行分组,因此我可以在我的Scrapy项目中逐行循环写入数据。

我想把它们中的每一个分组成var,然后循环var,我不确定如何去做。

请建议。谢谢,

到目前为止,我已经尝试过这个:

def parse(self, response):
    now = time.strftime('%Y-%m-%d %H:%M:%S')
    hxs = scrapy.Selector(response)

    titles = hxs.xpath('//div[@class="list-article"]/h1')
    images = hxs.xpath('//div[@class="list-article"]/feature-image')
    contents = hxs.xpath('//div[@class="list-article"]/article-content')

    for i, title in titles:
        item = DapnewsItem()
        item['categoryId'] = '1'

        name = titles[i].xpath('a/text()')
        if not name:
            print('DAP => [' + now + '] No title')
        else:
            item['name'] = name.extract()[0]

        description = contents[i].xpath('p/text()')
        if not description:
            print('DAP => [' + now + '] No description')
        else:
            item['description'] = description[1].extract()

        url = titles[i].xpath("a/@href")
        if not url:
            print('DAP => [' + now + '] No url')
        else:
            item['url'] = url.extract()[0]

        imageUrl = images[i].xpath('img/@src')
        if not imageUrl:
            print('DAP => [' + now + '] No imageUrl')
        else:
            item['imageUrl'] = imageUrl.extract()[0]

        yield item

这是我得到的错误。

enter image description here

1 个答案:

答案 0 :(得分:1)

让我们使用这个HTML代码段来说明:

<div class="list-article">

    <h1><a href="http//www.example.com/article1.html">Title 1</h1>
    <div class="article-meta">Something for 1</div>
    <div class="feature-image"><img src="http://www.example.com/image1.jpg"></div>
    <div class="article-content"><p>Content 1</p></div>

    <h1><a href="http//www.example.com/article2.html">Title 2</h1>
    <div class="article-meta">Something for 2</div>
    <div class="feature-image"><img src="http://www.example.com/image2.jpg"></div>
    <div class="article-content"><p>Content 2</p></div>

    <h1><a href="http//www.example.com/article3.html">Title 3</h1>
    <div class="article-meta">Something for 3</div>
    <div class="feature-image"><img src="http://www.example.com/image3.jpg"></div>
    <div class="article-content"><p>Content 3</p></div>

</div>

您可以循环访问每个<h1>并使用XPath's following-sibling axis检查树中同一级别后面的元素,然后对第一个元素进行过滤:例如:第一个following-sibling::div[@class="feature-image"][1]

<div class="feature-image">
>>> selector = scrapy.Selector(text='''<div class="list-article">
... 
...     <h1><a href="http//www.example.com/article1.html">Title 1</h1>
...     <div class="article-meta">Something for 1</div>
...     <div class="feature-image"><img src="http://www.example.com/image1.jpg"></div>
...     <div class="article-content"><p>Content 1</p></div>
... 
...     <h1><a href="http//www.example.com/article2.html">Title 2</h1>
...     <div class="article-meta">Something for 2</div>
...     <div class="feature-image"><img src="http://www.example.com/image2.jpg"></div>
...     <div class="article-content"><p>Content 2</p></div>
... 
...     <h1><a href="http//www.example.com/article3.html">Title 3</h1>
...     <div class="article-meta">Something for 3</div>
...     <div class="feature-image"><img src="http://www.example.com/image3.jpg"></div>
...     <div class="article-content"><p>Content 3</p></div>
...     
... </div>''')

>>> for h in selector.css('div.list-article > h1'):
...     item = {
...         'title': h.xpath('a/text()').extract_first(),
...         'image': h.xpath('''
...             following-sibling::div[@class="feature-image"][1]
...                 /img/@src''').extract_first(),
...         'content': h.xpath('''
...             following-sibling::div[@class="article-content"][1]
...                 /p/text()''').extract_first(),
...     }
...     print(item)
... 
{'content': u'Content 1', 'image': u'http://www.example.com/image1.jpg', 'title': u'Title 1'}
{'content': u'Content 2', 'image': u'http://www.example.com/image2.jpg', 'title': u'Title 2'}
{'content': u'Content 3', 'image': u'http://www.example.com/image3.jpg', 'title': u'Title 3'}
>>>