我试图提取这个html中的所有文本,这些文本位于itemprop =" ingredients"。
我看到this answer,它正是我想要的,但是指定了元素,我的文字没有嵌套在里面。
这是html:
<li itemprop="ingredients">Beginning of ingredient
<a href="some-link" data-ct-category="Other"
data-ct-action="Site Search"
data-ct-information="Recipe Search - Hellmann's® or Best Foods® Real Mayonnaise"
data-ct-attr="some_attr">Rest of Ingredient</a>
</li>
<li itemprop="ingredients">Another ingredient</li>
<li itemprop="ingredients">Another ingredient</li>
<li itemprop="ingredients">Another ingredient</li>
<li itemprop="ingredients">Another ingredient</li>
<li itemprop="ingredients">Another ingredient</li>
我需要的是将文本作为列表返回,此列表中的第一个元素将是&#34;成分的开头在此处插入空格,加入或其他其他成分&# 34;和其他元素将是&#34;另一种成分&#34;。
我接近:
for row in response.xpath('//*[@itemprop="ingredients"]/descendant-or-self::*/text()'):
... print row.extract()
...
Beginning of ingredient
Rest of Ingredient
Another ingredient
Another ingredient
Another ingredient
Another ingredient
Another ingredient
因此,当我在每行使用extract_first()将其放入列表中时,我得到了这个:
['Beginning of ingredient', "Rest of Ingredient", 'Another ingredient', 'Another ingredient', 'Another ingredient', 'Another ingredient', 'Another ingredient']
但我想要这个:
['Beginning of ingredient Rest of Ingredient', 'Another ingredient', 'Another ingredient', 'Another ingredient', 'Another ingredient', 'Another ingredient']
答案 0 :(得分:0)
你很接近,克服每个li
元素,然后调用特定于上下文的descendant-or-self
:
In [1]: [" ".join(map(unicode.strip, item.xpath("descendant-or-self::text()").extract()))
for item in response.xpath('//li[@itemprop="ingredients"]')]
Out[1]:
[u'Beginning of ingredient Rest of Ingredient ',
u'Another ingredient',
u'Another ingredient',
u'Another ingredient',
u'Another ingredient',
u'Another ingredient']