Question

使用scrapy蜘蛛。

我有这个HTML：

<div class="sliderContent">
<p>some content, some other content</p>
<p>some content, some other content</p>
<p>some content, some other content</p>
<p>some content, some other content</p>
</div>

我的xpath：

item['Description'] = sel.xpath('div[@class="content"]/div/div[@class="sliderContent"]//p').extract()

我想在<p>中转义逗号并提取所有内容，保留html。我试过这个：

    def parse_dir_contents(self, response):
        for sel in response.xpath('//div[@class="container"]'):
        item = LuItem()
        item['Description'] = sel.xpath('div[@class="content"]/div/div[@class="sliderContent"]//p').extract()[0].replace(',','\,')
        yield item

这显然适用于第一个<p>，但我怎样才能为所有<p>实现这一目标？

从python开始，任何帮助都非常感谢！

Answer 1

您的解析结果是一个列表，并且您只修改列表[0]中的第一个元素，您需要浏览整个描述列表：

def parse_dir_contents(self, response):
    for sel in response.xpath('//div[@class="container"]'):
        item = LuItem()
        item['Description'] = sel.xpath('div[@class="content"]/div/div[@class="sliderContent"]//p').extract()
        item['Description'] = [ ''.join(field.split(',')) for field in item.get('Description', [])]
        yield item

在Scrapy Xpath中逃脱逗号

1 个答案: