我刚刚在我的Ubuntu 10.04机器上安装了python,mplayer,beautifulsoup和sipie来运行Sirius。我遵循了一些看似简单的文档,但遇到了一些问题。我对Python并不熟悉,所以这可能不属于我的联盟。
我能够安装好所有东西,但是然后运行sipie就可以了:
/usr/bin/Sipie/Sipie/Config.py:12: DeprecationWarning: the md5 module is deprecated; use hashlib instead import md5
Traceback (most recent call last):
File "/usr/bin/Sipie/sipie.py", line 22, in <module>
Sipie.cliPlayer()
File "/usr/bin/Sipie/Sipie/cliPlayer.py", line 74, in cliPlayer
completer = Completer(sipie.getStreams())
File "/usr/bin/Sipie/Sipie/Factory.py", line 374, in getStreams
streams = self.tryGetStreams()
File "/usr/bin/Sipie/Sipie/Factory.py", line 298, in tryGetStreams
soup = BeautifulSoup(data)
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py", line 1499, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py", line 1230, in __init__
self._feed(isHTML=isHTML)
File "/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.py", line 1263, in _feed
self.builder.feed(markup)
File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 226, in parse_starttag
endpos = self.check_for_whole_start_tag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 301, in check_for_whole_start_tag
self.error("malformed start tag")
File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 100, column 3
我查看了这些文件和行号,但由于我不熟悉Python,因此没有多大意义。关于下一步该做什么的任何建议?
答案 0 :(得分:15)
假设您使用的是BeautifulSoup4,我在官方文档中发现了一些内容:http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
如果您使用的是早于2.7.3的Python 2版本或版本 在3.2.2之前的Python 3中,安装lxml至关重要 或html5lib-Python的内置HTML解析器不是很好 旧版本。
我试过这个并且效果很好,就像@Joshua
一样soup = BeautifulSoup(r.text, 'html5lib')
答案 1 :(得分:8)
您遇到的问题非常常见,它们专门处理格式错误的HTML。在我的例子中,有一个HTML元素,它引用了一个属性的值。我实际上今天碰到了这个问题,所以这样做会发现你的帖子。我最终能够通过html5lib解析HTML来解决这个问题,然后再将它从FineSoup 4中移除。
首先,你需要:
sudo easy_install bs4
sudo apt-get install python-html5lib
然后,运行此示例代码:
from bs4 import BeautifulSoup
import html5lib
from html5lib import sanitizer
from html5lib import treebuilders
import urllib
url = 'http://the-url-to-scrape'
fp = urllib.urlopen(url)
# Create an html5lib parser. Not sure if the sanitizer is required.
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"), tokenizer=sanitizer.HTMLSanitizer)
# Load the source file's HTML into html5lib
html5lib_object = parser.parse(file_pointer)
# In theory we shouldn't need to convert this to a string before passing to BS. Didn't work passing directly to BS for me however.
html_string = str(html5lib_object)
# Load the string into BeautifulSoup for parsing.
soup = BeautifulSoup(html_string)
for content in soup.findAll('div'):
print content
如果您对此代码有任何疑问或需要更具体的指导,请告诉我们。 :)
答案 2 :(得分:2)
更新版本的BeautifulSoup uses HTMLParser rather than SGMLParser(由于SGMLParser已从Python 3.0标准库中删除)。因此,BeautifulSoup无法再正确处理许多格式错误的HTML文档,这是我相信您在这里遇到的。
问题的解决方案很可能是uninstall BeautifulSoup, and install an older version(在Ubuntu 10.04LTS上仍然适用于Python 2.6):
sudo apt-get remove python-beautifulsoup
sudo easy_install -U "BeautifulSoup==3.0.7a"
请注意,此临时解决方案将不再适用于Python 3.0(在未来的Ubuntu版本中可能会成为默认设置)。
答案 3 :(得分:2)
命令行:
$ pip install beautifulsoup4
$ pip install html5lib
Python 3:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = 'http://www.example.com'
page = urlopen(url)
soup = BeautifulSoup(page.read(), 'html5lib')
links = soup.findAll('a')
for link in links:
print(link.string, link['href'])
答案 4 :(得分:-2)
请查看文件“/usr/bin/Sipie/Sipie/Factory.py”第298行中提到的“数据”中第100行的第3行