Question

我是scrapy的新手，也是Python的新手。

我想检索项目['评级']。评级是一个字符串“评级是4”的形式，但我只想要数字......我怎么能得到它？

我在下面解决了这些问题，但不知道它们是否有任何意义。没有人在工作。

> item_pub['rating'] = review.xpath('/html/body//*/div[@class="details"]/table[@class="detailtoptable"]/tbody/tr[1]/td/img/@alt').re(r'\d+') #to extract only the number since the result with extract() would be "rating is 4"

或

 > item_pub['rating'] = review.xpath('/html/body//*/div[@class="details"]/table[@class="detailtoptable"]/tbody/tr[1]/td/img/@alt')[-1:].extract() #to extract only the number since the result with extract() would be "rating is 4"

非常感谢你的帮助和对不起我的英语，我希望我的问题很清楚。

Answer 1

你的思维方式没问题，使用正则表达式。你的Xpath很糟糕。
以下是一些提示：

无需执行/html/body//，您可以//
无需选择//*的所有元素，以便稍后选择单个元素。您可以继续选择所需的元素：//div
如果您使用浏览器找到此xpath，则很可能没有真正的tbody元素，因为浏览器经常添加这些

试试这样：

item_pub['rating'] = review.xpath('//div[@class="details"]/table[@class="detailtoptable"]/tr[1]/td/img/@alt').re_first(r'\d+')

Answer 2

通过美丽的汤，你可以这样做，

>>> from bs4 import BeautifulSoup
>>> s = '''<td> <img alt="rating is 4" title="rating is 4" src="/Shared\images\ratingstars_web8.gif"/> </td>'''
>>> [re.search(r'\d+', i['alt']).group() for i in soup.select('td > img[alt*="rating"]')]
['4']

scrapy选择器xpath提取匹配正则表达式或切片字符串

2 个答案: