我有这个功能来读取保存在计算机上的已保存的HTML文件:
def get_doc_ondrive(self,mypath):
the_file = open(mypath,"r")
line = the_file.readline()
if(line != "")and (line!=None):
self.soup = BeautifulSoup(line)
else:
print "Something is wrong with line:\n\n%r\n\n" % line
quit()
print "\t\t------------ line: %r ---------------\n" % line
while line != "":
line = the_file.readline()
print "\t\t------------ line: %r ---------------\n" % line
if(line != "")and (line!=None):
print "\t\t\tinner if executes: line: %r\n" % line
self.soup.feed(line)
self.get_word_vector()
self.has_doc = True
执行self.soup = BeautifulSoup(open(mypath,"r"))
会返回None,但是至少会崩溃并逐行提供给我一些内容。
我编辑了BeautifulSoup.py和sgmllib.py中的回溯列出的功能
当我尝试运行时,我得到:
me@GIGABYTE-SERVER:code$ python test_docs.py
in sgml.finish_endtag
in _feed: inDocumentEncoding: None, fromEncoding: None, smartQuotesTo: 'html'
in UnicodeDammit.__init__: markup: '<!DOCTYPE html>\n'
in UnicodeDammit._detectEncoding: xml_data: '<!DOCTYPE html>\n'
in sgmlparser.feed: rawdata: '', data: u'<!DOCTYPE html>\n' self.goahead(0)
------------ line: '<!DOCTYPE html>\n' ---------------
------------ line: '<html dir="ltr" class="client-js ve-not-available" lang="en"><head>\n' ---------------
inner if executes: line: '<html dir="ltr" class="client-js ve-not-available" lang="en"><head>\n'
in sgmlparser.feed: rawdata: u'', data: '<html dir="ltr" class="client-js ve-not-available" lang="en"><head>\n' self.goahead(0)
in sgmlparser.goahead: end: 0,rawdata[i]: u'<', i: 0,literal:0
in sgmlparser.parse_starttag: i: 0, __starttag_text: None, start_pos: 0, rawdata: u'<html dir="ltr" class="client-js ve-not-available" lang="en"><head>\n'
in sgmlparser.goahead: end: 0,rawdata[i]: u'<', i: 61,literal:0
in sgmlparser.parse_starttag: i: 61, __starttag_text: None, start_pos: 61, rawdata: u'<html dir="ltr" class="client-js ve-not-available" lang="en"><head>\n'
------------ line: '<meta http-equiv="content-type" content="text/html; charset=UTF-8">\n' ---------------
inner if executes: line: '<meta http-equiv="content-type" content="text/html; charset=UTF-8">\n'
in sgmlparser.feed: rawdata: u'', data: '<meta http-equiv="content-type" content="text/html; charset=UTF-8">\n' self.goahead(0)
in sgmlparser.goahead: end: 0,rawdata[i]: u'<', i: 0,literal:0
in sgmlparser.parse_starttag: i: 0, __starttag_text: None, start_pos: 0, rawdata: u'<meta http-equiv="content-type" content="text/html; charset=UTF-8">\n'
in sgml.finish_starttag: tag: u'meta', attrs: [(u'http-equiv', u'content-type'), (u'content', u'text/html; charset=UTF-8')]
in start_meta: attrs: [(u'http-equiv', u'content-type'), (u'content', u'text/html; charset=UTF-8')] declaredHTMLEncoding: u'UTF-8'
in _feed: inDocumentEncoding: u'UTF-8', fromEncoding: None, smartQuotesTo: 'html'
in UnicodeDammit.__init__: markup: None
in UnicodeDammit._detectEncoding: xml_data: None
和追溯:
Traceback (most recent call last):
File "test_docs.py", line 28, in <module>
newdoc.get_doc_ondrive(testeee)
File "/home/jddancks/Capstone/Python/code/pkg/vectors/DOCUMENT.py", line 117, in get_doc_ondrive
self.soup.feed(line)
File "/usr/lib/python2.7/sgmllib.py", line 104, in feed
self.goahead(0)
File "/usr/lib/python2.7/sgmllib.py", line 139, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.7/sgmllib.py", line 298, in parse_starttag
self.finish_starttag(tag, attrs)
File "/usr/lib/python2.7/sgmllib.py", line 348, in finish_starttag
self.handle_starttag(tag, method, attrs)
File "/usr/lib/python2.7/sgmllib.py", line 385, in handle_starttag
method(attrs)
File "/usr/lib/python2.7/dist-packages/BeautifulSoup.py", line 1618, in start_meta
self._feed(self.declaredHTMLEncoding)
File "/usr/lib/python2.7/dist-packages/BeautifulSoup.py", line 1172, in _feed
smartQuotesTo=self.smartQuotesTo, isHTML=isHTML)
File "/usr/lib/python2.7/dist-packages/BeautifulSoup.py", line 1776, in __init__
self._detectEncoding(markup, isHTML)
File "/usr/lib/python2.7/dist-packages/BeautifulSoup.py", line 1922, in _detectEncoding
'^<\?.*encoding=[\'"](.*?)[\'"].*\?>').match(xml_data)
TypeError: expected string or buffer
所以这一行
<meta http-equiv="content-type" content="text/html; charset=UTF-8">\n
以某种方式导致在UnicodeDammit中解析空字符串。为什么会这样?
答案 0 :(得分:1)
我刚读完the source,我想我明白了这个问题。从本质上讲,这就是BeautifulSoup认为事情应该如何发展:
BeautifulSoup
。self.markup
设置为该标记。_feed
,它会重置文档并以最初检测到的编码对其进行解析。meta
标记,表示不同的编码。_feed
,重新调整self.markup
。_feed
以及它递归的_feed
完成后,它会将self.markup
设置为None
。 (毕竟,我们现在已经解析了所有内容; <sarcasm>
谁能曾再次需要原始标记?</sarcasm>
)但是你使用它的方式:
BeautifulSoup
。self.markup
设置为标记的第一行并调用_feed
。_feed
在第一行看不到有趣的meta
标记,因此成功完成。self.markup
设置回None
并返回。feed
对象上致电BeautifulSoup
,该对象直接转到SGMLParser.feed
实施,但未被BeautifulSoup
覆盖。meta
标记,并调用_feed
来解析此新编码中的文档。_feed
尝试使用UnicodeDammit
构建self.markup
对象。self.markup
是None
,因为它认为它只会在BeautifulSoup
的构造函数中的那一小块时间内被调用。故事的道德是feed
是一种不受支持的向BeautifulSoup
发送输入的方式。你必须立即传递所有输入。
至于BeautifulSoup(open(mypath, "r"))
返回None
的原因,我不知道;我在__new__
上看不到BeautifulSoup
,因此似乎必须返回BeautifulSoup
个对象。
所有这一切,你可能想要使用BeautifulSoup 4而不是3. Here’s the porting guide.为了支持Python 3,它必须删除对{{1}的依赖如果在重写的那一部分你遇到的任何错误被修复了,我不会感到惊讶。