如何使用scrapy废除各种标签之间的文本

时间:2013-07-01 14:33:37

标签: python scrapy

我正在尝试从此link中删除产品说明。但是如何废弃整个文本,包括 标记之间的文本。这是hxs对象 hxs.select('//div[@class="overview"]/div/text()').extract(),但原始HTML:

These classic sneakers from
<b>Puma</b>
are best known for their neat and simple design. These basketball shoes are crafted by novel tooling that brings the sleek retro sneaker look. The pair is equipped with a
<b>leather and synthetic upper.</b>
A vulcanized non-slip rubber sole that is
<b>abrasion resistant ensures good traction.</b>

如果我使用上面提到的hxs对象,我会得到这个:

hxs.select('//div[@class="overview"]/div/text()').extract()
Output: 
[u'These classic sneakers from ',
 u' are best known for their neat and simple design. These basketball shoes are crafted by novel tooling that brings the sleek retro sneaker look. The pair is equipped with a ',
 u' A vulcanized non-slip rubber sole that is ',
 u' sportswear, jeans and tees.',
 u' Gently brush away dust or dirt using a soft cleaning brush.',
 u'\r\nUse a leather conditioner/wax and a brush for added shine.',
 u'Avoid contact with liquids.\xa0']

我想要的是这个:

These classic sneakers from Puma are best known for their neat and simple design. These
 basketball shoes are crafted by novel tooling that brings the sleek retro sneaker look. The pair is equipped with a leather and synthetic upper.A vulcanized non-slip rubber sole 
that is abrasion resistant ensures good traction.

正如您所看到的 之间的文字丢失所以你能告诉我如何从页面中提取整个文本。

1 个答案:

答案 0 :(得分:3)

尝试使用

从标记中获取整个内容
 //div[@class="overview"]/div

然后您可以使用正则表达式从中删除标记,或者如果它们不是问题则保留它们。

像这样的正则表达式:

 re.sub('<[^>]*>', '', mystring)