Question

对于一个课程，我有一个练习，我需要计算任何给定网页上的图像数量。我知道每个图像都以，所以我使用正则表达式尝试找到它们。但我不断得到一个我知道错误的错误，我的代码出了什么问题：

import urllib
import urllib.request
import re
img_pat = re.compile('<img.*>',re.I)

def get_img_cnt(url):
  try:
      w =  urllib.request.urlopen(url)
  except IOError:
      sys.stderr.write("Couldn't connect to %s " % url)
      sys.exit(1)
  contents =  str(w.read())
  img_num = len(img_pat.findall(contents))
  return (img_num)

print (get_img_cnt('http://www.americascup.com/en/schedules/races'))

Answer 1

不要使用正则表达式来解析HTML，使用html解析器，如lxml或BeautifulSoup。以下是一个工作示例，如何使用img和requests获取BeautifulSoup代码计数：

from bs4 import BeautifulSoup
import requests


def get_img_cnt(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content)

    return len(soup.find_all('img'))


print(get_img_cnt('http://www.americascup.com/en/schedules/races'))

以下是使用lxml和requests：

的工作示例

from lxml import etree
import requests


def get_img_cnt(url):
    response = requests.get(url)
    parser = etree.HTMLParser()
    root = etree.fromstring(response.content, parser=parser)

    return int(root.xpath('count(//img)'))


print(get_img_cnt('http://www.americascup.com/en/schedules/races'))

两个代码段都打印106。

另见：

希望有所帮助。

Answer 2

啊，正则表达式。

您的正则表达式模式<img.*>说“找到以<img开头的内容和内容，并确保其以>结尾。

正则表达式是贪婪的;它会尽可能地填充.*所有内容，同时在某个地方留下一个>字符以满足模式。在这种情况下，它会一直走到最后，<html>并说“看！我在那里找到>！”

你应该通过.*非贪婪来提出正确的数量：

<img.*?>

Answer 3

你的正则表达式是贪婪的，所以它比你想要的更多。我建议使用HTML解析器。

如果你必须以正则表达式方式执行，那么

img_pat = re.compile('<img.*?>',re.I)将会成功。 ?使其变得非贪婪。

一个很好的网站，用于检查您的正则表达式即时匹配：http://www.pyregex.com/
详细了解正则表达式：http://docs.python.org/2/library/re.html

使用urllib计算网页上的图像数量

3 个答案: