Question

我是python的新手，并且正在使用scrapy网络爬虫。我想抓取描述字符串的前10个字符并将其用作标题

下面的python代码片段会产生下面的JSON

item['image'] = img.xpath('@src').extract()
item_desc = img.xpath('@title').extract()
print(item_desc)
item['description'] = item_desc
item['title'] = item_desc[:10]
item['parentUrl'] = response.url

{'description': [u'CHAR-BROIL Tru-Infrared 350 IR Gas Grill - SportsAuthority.com '],
 'image': [u'http://www.sportsauthority.com/graphics/product_images/pTSA-10854895t130.jpg'],
 'parentUrl': 'http://www.sportsauthority.com/category/index.jsp?categoryId=3077576&clickid=topnav_Jerseys+%26+Fan+Shop',
 'title': [u'CHAR-BROIL Tru-Infrared 350 IR Gas Grill - SportsAuthority.com ']}

我想要的是以下内容。切片不像预期的那样表现。

{'description': [u'CHAR-BROIL Tru-Infrared 350 IR Gas Grill - SportsAuthority.com '],
 'image': [u'http://www.sportsauthority.com/graphics/product_images/pTSA-10854895t130.jpg'],
 'parentUrl': 'http://www.sportsauthority.com/category/index.jsp?categoryId=3077576&clickid=topnav_Jerseys+%26+Fan+Shop',
 'title': [u'CHAR-BROIL']}

Answer 1

item_desc是一个列表，其中包含一个元素，该元素是一个unicode字符串。它本身不是unicode字符串。 [...]是一个很大的提示。

获取元素，切片并将其放回列表中：

item['title'] = [item_desc[0][:10]]

显然.extract()函数可以返回多个匹配项;如果你只期待一场比赛，你也可以选择第一个：

item['image'] = img.xpath('@src').extract()[0]
item_desc = img.xpath('@title').extract()[0]
item['description'] = item_desc
item['title'] = item_desc[:10]

如果您的XPath查询并不总是返回结果，请先测试一个空列表：

img_match = img.xpath('@src').extract()
item['image'] = img_match[0] if img_match else ''
item_desc = img.xpath('@title').extract()
item['description'] = item_desc[0] if item_desc else ''
item['title'] = item_desc[0][:10] if item_desc else ''

在python中切片unicode字符串的正确方法是什么？

1 个答案: