请使用Scrapy基于python的框架来抓取网站,但我无法弄清楚如何选择类value ellipsis ph
的文本。有时在课堂上有一个很强的标签。到目前为止,我已经成功提取了没有strong
的子标记的文本。
<div class="right">
<div class="attrs">
<div class="attr">
<span class="name">Main Products:</span>
<div class="value ellipsis ph">
// Here below i needed to select it ignoring the strong tag
<strong>Shoes</strong>
(Sport
<strong>Shoes</strong>
,Casual
<strong>Shoes</strong>
,Hiking
<strong>Shoes</strong>
,Skate
<strong>Shoes</strong>
,Football
<strong>Shoes</strong>
)
</div>
</div>
</div>
</div>
<div class="right">
<div class="attrs">
<div class="attr">
<span class="name">Main Products:</span>
<div class="value ellipsis ph">
Cap, Shoe, Bag // could select this
</div>
</div>
</div>
</div>
从所选节点的根目录开始,这是有效的。只选择没有强标记的文本。
"/div[@class='right']/div[@class='attrs']/div[@class='attr']/div/text()").extract()
答案 0 :(得分:2)
正如@ splash58在评论中写的那样
//div[@class="value ellipsis ph"]//text()
XPath获取两个文本内容。当然,在第一部分中,它是一个文本列表 - 但是它们包含<strong>
标签中的文本以及它们之外的文本。因为text()
获取子树内的所有文本内容 - 即使有更多子标记可用。
答案 1 :(得分:2)
假设您想要div
元素与value ellipsis ph
类的文本表示,您可以:
.//text()
div
元素以下是两个选项:
>>> selector = scrapy.Selector(text="""<div class="right">
... <div class="attrs">
... <div class="attr">
... <span class="name">Main Products:</span>
... <div class="value ellipsis ph">
... <!-- // Here below i needed to select it ignoring the strong tag -->
... <strong>Shoes</strong>
... (Sport
... <strong>Shoes</strong>
... ,Casual
... <strong>Shoes</strong>
... ,Hiking
... <strong>Shoes</strong>
... ,Skate
... <strong>Shoes</strong>
... ,Football
... <strong>Shoes</strong>
... )
... </div>
... </div>
... </div>
... </div>
...
...
... <div class="right">
... <div class="attrs">
... <div class="attr">
... <span class="name">Main Products:</span>
... <div class="value ellipsis ph">
... Cap, Shoe, Bag <!-- // could select this -->
...
... </div>
... </div>
... </div>
... </div>""")
>>> for div in selector.css('div.value.ellipsis.ph'):
... print "---"
... print "".join(div.xpath('.//text()').extract())
...
---
Shoes
(Sport
Shoes
,Casual
Shoes
,Hiking
Shoes
,Skate
Shoes
,Football
Shoes
)
---
Cap, Shoe, Bag
>>> for div in selector.css('div.value.ellipsis.ph'):
... print "---"
... print div.xpath('string()').extract_first()
...
---
Shoes
(Sport
Shoes
,Casual
Shoes
,Hiking
Shoes
,Skate
Shoes
,Football
Shoes
)
---
Cap, Shoe, Bag
>>>