在Scrapy中连接Xpath嵌套文本

时间:2015-07-20 00:41:52

标签: python html xpath web-scraping scrapy

我一直试图在Scrapy中连接一些嵌套文本和xpath。我认为它使用xpath 1.0?我看了很多其他帖子,但似乎没有什么能达到我想要的效果

以下是html(实际页面http://adventuretime.wikia.com/wiki/List_of_episodes)的具体部分:

<tr>
<td colspan="5" style="border-bottom: #BCD9E3 3px solid">
    Finn and Princess Bubblegum must protect the <a href="/wiki/Candy_Kingdom" title="Candy Kingdom">Candy Kingdom</a> from a horde of candy zombies they accidentally created.
</td>
</tr>

<tr>
<td colspan="5" style="border-bottom: #BCD9E3 3px solid">
Finn must travel to <a href="/wiki/Lumpy_Space" title="Lumpy Space">Lumpy Space</a> to find a cure that will save Jake, who was accidentally bitten by <a href="/wiki/Lumpy_Space_Princess" title="Lumpy Space Princess">Lumpy Space Princess</a> at Princess Bubblegum's annual 'Mallow Tea Ceremony.'
</td>
</tr>

(much more stuff here)

以下是我想要的结果:

[u'Finn and Princess Bubblegum must protect the Candy Kingdom from a horde of candy zombies they accidentally
    created.\n', u'Finn must travel to Lumpy Space to find a cure that will save Jake, who was accidentally bitten', (more stuff here)]

我尝试过使用答案 HTML XPath: Extracting text mixed in with multiple tags?

description =sel.xpath("//table[@class='wikitable']/tr[position()>1]/td[@colspan='5']/parent::tr/td[descendant-or-self::text()]").extract()

但这只会让我回来

[u'<td colspan="5" style="border-bottom: #BCD9E3 3px solid">Finn and Princess Bubblegum must protect the <a href="/wiki/
Candy_Kingdom" title="Candy Kingdom">Candy Kingdom</a> from a horde of candy zombies they accidentally created.\n</td>',

string()答案对我来说似乎不起作用......我只收到一个条目的清单,而且还应该有更多。

我最接近的是:

description = sel.xpath("//table[@class='wikitable']/tr[position()>1]/td[@colspan='5']//text()").extract()

这让我回来了

[u'Finn and Princess Bubblegum must protect the ', u'Candy Kingdom', u' from a horde of candy zombies they accidentally
created.\n', u'Finn must travel to ', u'Lumpy Space', u' to find a cure that will save Jake, who was accidentally bitten, (more stuff here)]

任何人都有关于连接的xpath提示吗?

谢谢!

编辑:蜘蛛代码

class AT_Episode_Detail_Spider_2(Spider):

    name = "ep_detail_2"
    allowed_domains = ["adventuretime.wikia.com"]
    start_urls = [
        "http://adventuretime.wikia.com/wiki/List_of_episodes"
    ]

    def parse(self, response):
        sel = Selector(response)

        description = sel.xpath("//table[@class='wikitable']/tr[position()>1]/td[@colspan='5']//text()").extract()
        print description

1 个答案:

答案 0 :(得分:3)

通过join()手动连接:

description = " ".join(sel.xpath("//table[@class='wikitable']/tr[position()>1]/td[@colspan='5']//text()").extract())

或将Join()处理器与Item Loader结合使用。

以下是获取剧集描述列表的示例代码:

def parse(self, response):
    description = [" ".join(row.xpath(".//text()[not(ancestor::sup)]").extract())
                   for row in response.xpath("//table[@class='wikitable']/tr[position()>1]/td[@colspan]")]
    print description