Question

我正在尝试收集html文件中有多少个超链接。为此，我想在Python中读取html文件，并搜索所有</a>锚点。但是，当我尝试通过python传递一个html文件时，我得到的错误是：

“UnicodeDecodeError：'ascii'编解码器无法将字节0xe2解码到位 1819：序数不在范围内（128）“

但是，如果我将相同的文本复制并粘贴到txt文件中，那么我的代码就可以了。我的代码如下：

def links(filename):
    infile = open(filename)
    content = infile.read()
    infile.close()
    anchorTagEnd = content.count("</a>")
    return anchorTagEnd

print(links("DePaul CDM - College of Computing and Digital Media.html"))

Answer 1

为什么不使用 HTML解析器来计算HTML文件中的链接。

使用BeautifulSoup：

from bs4 import BeautifulSoup

def links(filename):
    soup = BeautifulSoup(open(filename))
    return len(soup.find_all('a'))

print(links("DePaul CDM - College of Computing and Digital Media.html"))

使用lxml.html：

import lxml.html

def links(filename):
    tree = lxml.html.parse(filename)
    return tree.xpath('count(//a)')[0]

print(links("DePaul CDM - College of Computing and Digital Media.html"))

无法在Python中打开html文件

1 个答案: