Question

嗨，我正在尝试使用python中的BeautifulSoup函数从HTML提取文本-运行良好，但我没有得到我所需要的。我的代码如下：

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urlopen(url).read()
raw = BeautifulSoup(html).get_text()

Python控制台报告了以下内容，我不理解该问题，将不胜感激。

raw = BeautifulSoup(html).get_text()
C:/Users/muradz14/.spyder-py3/raw.py:1: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 1 of the file C:/Users/muradz14/.spyder-py3/raw.py. To get rid of this warning, pass the additional argument 'features="lxml"' to the BeautifulSoup constructor.

Answer 1

那只是一个警告。这是不言自明的，但是代码在不同的解析器中表现出不同的可能性很小，因此警告提示您可能要指定使用的内容。您可以按照以下建议进行操作： raw = BeautifulSoup(html, features="lxml").get_text()

请注意，某些系统具有不同的解析器。对我来说是features="html.parser"

使用BeautifulSoup从HTML中提取文本

1 个答案: