Question

我试图提取这个html中的所有文本，这些文本位于itemprop =＆＃34; ingredients＆＃34;。

我看到this answer，它正是我想要的，但是指定了元素，我的文字没有嵌套在里面。

这是html：

<li itemprop="ingredients">Beginning of ingredient
     <a href="some-link" data-ct-category="Other"
     data-ct-action="Site Search"
     data-ct-information="Recipe Search - Hellmann's® or Best Foods® Real Mayonnaise"
     data-ct-attr="some_attr">Rest of Ingredient</a>
</li>   
<li itemprop="ingredients">Another ingredient</li>
<li itemprop="ingredients">Another ingredient</li>
<li itemprop="ingredients">Another ingredient</li>
<li itemprop="ingredients">Another ingredient</li>
<li itemprop="ingredients">Another ingredient</li>

我需要的是将文本作为列表返回，此列表中的第一个元素将是＆＃34;成分的开头在此处插入空格，加入或其他其他成分＆＃ 34;和其他元素将是＆＃34;另一种成分＆＃34;。

我接近：

for row in response.xpath('//*[@itemprop="ingredients"]/descendant-or-self::*/text()'):
...      print row.extract()
...
Beginning of ingredient
Rest of Ingredient

    Another ingredient
    Another ingredient
    Another ingredient
    Another ingredient
    Another ingredient

因此，当我在每行使用extract_first（）将其放入列表中时，我得到了这个：

 ['Beginning of ingredient', "Rest of Ingredient", 'Another ingredient', 'Another ingredient', 'Another ingredient', 'Another ingredient', 'Another ingredient']

但我想要这个：

 ['Beginning of ingredient Rest of Ingredient', 'Another ingredient', 'Another ingredient', 'Another ingredient', 'Another ingredient', 'Another ingredient']

Answer 1

你很接近，克服每个li元素，然后调用特定于上下文的descendant-or-self：

In [1]: [" ".join(map(unicode.strip, item.xpath("descendant-or-self::text()").extract())) 
         for item in response.xpath('//li[@itemprop="ingredients"]')]
Out[1]: 
[u'Beginning of ingredient Rest of Ingredient ',
 u'Another ingredient',
 u'Another ingredient',
 u'Another ingredient',
 u'Another ingredient',
 u'Another ingredient']

在Scrapy中连接Xpath嵌套文本 - v2.0

1 个答案: