在Scrapy中连接Xpath嵌套文本 - v2.0

时间:2016-09-26 13:56:38

标签: python-2.7 xpath scrapy

我试图提取这个html中的所有文本,这些文本位于itemprop =" ingredients"。

我看到this answer,它正是我想要的,但是指定了元素,我的文字没有嵌套在里面。

这是html:

<li itemprop="ingredients">Beginning of ingredient
     <a href="some-link" data-ct-category="Other"
     data-ct-action="Site Search"
     data-ct-information="Recipe Search - Hellmann's® or Best Foods® Real Mayonnaise"
     data-ct-attr="some_attr">Rest of Ingredient</a>
</li>   
<li itemprop="ingredients">Another ingredient</li>
<li itemprop="ingredients">Another ingredient</li>
<li itemprop="ingredients">Another ingredient</li>
<li itemprop="ingredients">Another ingredient</li>
<li itemprop="ingredients">Another ingredient</li>

我需要的是将文本作为列表返回,此列表中的第一个元素将是&#34;成分的开头在此处插入空格,加入或其他其他成分&# 34;和其他元素将是&#34;另一种成分&#34;。

我接近:

for row in response.xpath('//*[@itemprop="ingredients"]/descendant-or-self::*/text()'):
...      print row.extract()
...
Beginning of ingredient
Rest of Ingredient

    Another ingredient
    Another ingredient
    Another ingredient
    Another ingredient
    Another ingredient

因此,当我在每行使用extract_first()将其放入列表中时,我得到了这个:

 ['Beginning of ingredient', "Rest of Ingredient", 'Another ingredient', 'Another ingredient', 'Another ingredient', 'Another ingredient', 'Another ingredient']

但我想要这个:

 ['Beginning of ingredient Rest of Ingredient', 'Another ingredient', 'Another ingredient', 'Another ingredient', 'Another ingredient', 'Another ingredient']

1 个答案:

答案 0 :(得分:0)

你很接近,克服每个li元素,然后调用特定于上下文的descendant-or-self

In [1]: [" ".join(map(unicode.strip, item.xpath("descendant-or-self::text()").extract())) 
         for item in response.xpath('//li[@itemprop="ingredients"]')]
Out[1]: 
[u'Beginning of ingredient Rest of Ingredient ',
 u'Another ingredient',
 u'Another ingredient',
 u'Another ingredient',
 u'Another ingredient',
 u'Another ingredient']