Question

我尝试从网页中提取内容。首先，我使用BeautifulSoup来提取一个名为＆＃34;得分＆＃34;其中包括几个像这样的图像

<img class="sprite-rating_s_fill rating_s_fill s45" src="http://e2.tacdn.com/img2/x.gif" alt="4.5 of 5 stars">

我想在这张图片中提取分数，对于这种情况，它是＆＃34; 4.5＆＃34;。所以我试着这样做：

pattern = re.compile('<img.*?alt="(.*?) of 5 stars">', re.S)
items = re.findall(pattern, scores)

但它不起作用。我是网络抓取的新手，所以有人可以帮我这个吗？

Answer 1

BeautifulSoup实际上很容易从这样的标签中提取信息！假设scores是一个BeautifulSoup Tag对象（您可以阅读in their documentation），您要做的是从标记中提取src属性：

src = scores['src']

对于您刚刚提供的示例，src应为u'4.5 out of 5 stars'。现在你只需要删除' out of 5 stars'：

removeIndex = src.index(' out of 5 stars')
score = src[:removeIndex]

你将留下score '4.5'。（如果您想将其作为数字进行操作，则必须执行score = float(score)。