Question

我正在尝试创建一个网页抓取器，我想使用BeautifulSoup这样做。我安装了BeautifulSoup 4.3.2，因为该网站称它与python 3.x兼容。我用了

pip install beautifulsoup4

安装它。但是当我跑步时

from bs4 import BeautifulSoup
import requests

url = input("Enter a URL (start with www): ")

link = "http://" + url

data = requests.get(link).content

soup = BeautifulSoup(data)

for link in soup.find_all('a'):

   print(link.get('href'))

我收到错误消息

Traceback (most recent call last):
File "/Users/user/Desktop/project.py", line 1, in <module>
  from bs4 import BeautifulSoup
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/site-packages   /bs4/__init__.py", line 30, in <module>
from .builder import builder_registry, ParserRejectedMarkup
File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/site-packages/bs4/builder /__init__.py", line 308, in <module>
from .. import _htmlparser
  ImportError: cannot import name _htmlparser

Answer 1

我认为源文件中可能存在错误，特别是在这里：

  File "/Library/Frameworks/Python.framework/Versions/3.1/lib/python3.1/site-packages/bs4/builder /__init__.py", line 308, in <module>
  from .. import _htmlparser

在我的安装中，bs4/builder /__init__.py

的第308行

  from . import _htmlparser

您可以在那里修复它，看看bs4是否会成功导入。不确定您安装了哪个版本的bs4，但我的版本是4.3.2，而_htmlparser.py也是bs4/builder

Answer 2

刚刚安装了Python 3.x并测试了最新的BS4下载。没工作。但是，可以在这里找到修复：https://github.com/il-vladislav/BeautifulSoup4（GitHub用户Il Vladislav，无论你是谁）。

下载zip，覆盖bs4下载中的BeautifulSoup文件夹，然后通过python setup.py install重新安装。现在就可以使用了，正如你在下面的截图中看到的那样，在完全工作之前出现错误。

<强>代码：

from bs4 import BeautifulSoup
import requests

url = input("Enter a URL (start with www): ")
link = "http://" + url
data = requests.get(link).content
soup = BeautifulSoup(data)

for link in soup.find_all('a'):
   print(link.get('href'))

<强>截图：

enter image description here

发现相关SO主题here，表明BS4 完全与Python 3.x完全兼容（即使在2年后）。

Answer 3

我刚编辑了bs4/builder/_htmlparser.py以便

A）未导入HTMLParseError

from html.parser import HTMLParser

B）定义了HTMLParseError类

class HTMLParseError(Exception):
    """Exception raised for all parse errors."""

    def __init__(self, msg, position=(None, None)):
        assert msg
        self.msg = msg
        self.lineno = position[0]
        self.offset = position[1]

    def __str__(self):
        result = self.msg
        if self.lineno is not None:
            result = result + ", at line %d" % self.lineno
        if self.offset is not None:
            result = result + ", column %d" % (self.offset + 1)
        return result

这可能不是最好的，因为不会引发HTMLParserError。但！你的例外将是未被捕获的，无论如何都是未处理的。

BeautifulSoup4在Python 3.x中抛出一个错误

3 个答案: