Question

现在我正在使用python编写一个Web爬虫，但有时会抛出HTMLParserError：

junk characters in start tag: u'\u201dTPL_password_1\u201d\r\n\t\t', at line 21285, column 6

它说错误是在第21285行发现的，这是否意味着在HTML源代码中的第21285行发现错误？如果没有，我怎么知道产生错误的当前HTML代码是什么？什么是当前的解析URL？

我的解析类可以简化如下：

class ParsePage(HTMLParser):

    """Parse the given page content using HTMLParser"""

    def __init__(self):
        HTMLParser.__init__(self)

    def handle_starttag(self, tag, attrs):

        #Here i tried to add `try...expect` to inspect the current tag and attrs, but it seems python didnt enter the except at all, why? the error message said the error was found at start tag, why it didnt enter the except at all?

        try:
            Some codes doing with the start tag...
        except HTMLParser.HTMLParseError, e:
            print "e: ", e, '\n' 
            print 'tag: ', tag, '\n'
            print 'attrs: ', atts, '\n'
            exit(1) 

    def handle_endtag(self, tag):
        #Some codes doing with end tags...



geturl = ParsePage()

#Here i can catch the HTMLParseError if i add `try...except` in the following line, but i dont know how to get the useful information here when i catch the exception    
geturl.feed(cur_page)

感谢您的帮助。

Answer 1

嗯，它告诉你发现错误的行。你还需要什么？

此外，URL与此有什么关系？您将HTML页面作为字符串传递给feed - HTMLParser不知道它来自何处。

Answer 2

我怎么知道产生错误的当前HTML代码是什么？

开始标记中的垃圾字符：u'\ u201dTPL_password_1 \ u201d \ r \ n \ t \ t'，在第21285行，第6栏

当前HTML页面中的html行号21285

以及当前的解析网址是什么？

你解析什么链接？

geturl.feed（cur_page）

cur_page是您当前的页面。

发生HTMLParserError时如何在HTML中查找错误行

2 个答案: