Question

我正在尝试阅读HTML内容并仅提取数据（例如维基百科文章中的行）。这是我在Python中的代码：

import urllib.request
from html.parser import HTMLParser

urlText = []


#Define HTML Parser
class parseText(HTMLParser):
    def handle_data(self, data):
        print(data)
        if data != '\n':
            urlText.append(data)


def main():

    thisurl = "https://en.wikipedia.org/wiki/Python_(programming_language)"
    #Create instance of HTML parser (the above class)
    lParser = parseText()
    #Feed HTML file into parser. The handle_data method is implicitly called.
    with urllib.request.urlopen(thisurl) as url:
        htmlAsBytes = url.read()
    #print(htmlAsBytes)
    htmlAsString = htmlAsBytes.decode(encoding="utf-8")
    #print(htmlAsString)
    lParser.feed(htmlAsString)
    lParser.close()
    #for item in urlText:
        #print(item)

我从网页上获取HTML内容，如果我打印read（）方法返回的bytes对象，看起来我收到了网页的所有HTML内容。但是，当我尝试解析这些内容以删除标签并仅存储可读数据时，我得不到我期望的结果。

问题是，为了使用解析器的feed（）方法，必须将bytes对象转换为字符串。为此，您可以使用decode（）方法，该方法接收用于进行转换的编码。如果我打印解码的字符串，打印的内容不包含数据本身（我试图提取的有用的可读数据）。为什么会发生这种情况，我该如何解决？

注意：我使用的是Python 3。

感谢您的帮助。

Answer 1

好吧，我最终使用beautifulsoup来完成这项工作，正如Alden所建议的那样，但我仍然不知道为什么解码过程神秘地摆脱了数据。

将Python的字节对象转换为字符串会导致html中的数据消失

1 个答案: