问题案例1：

Question

我是一名刚毕业的学生，刚开始自学有关python webscrapping的问题，为了好玩，我正在尝试构建一个脚本，允许我存储来自特定网站的动漫节目的名称，剧集和剧集描述，使用python请求，re和其他相关模块。

我已经设法获得脚本工作的webscrapping方面，这是打开必要的URL并检索相关数据，但是，我一直无法克服的一个主要问题是不同的编码和特殊的html字符解码包含在某些名称中节目。

经过几个堆栈溢出网站后，我想出了以下解决方案，试图解决这个解码html字符和修复编码的问题：

try:
    # Python 2.6-2.7 
    from HTMLParser import HTMLParser

except ImportError:
    # Python 3
    from html.parser import HTMLParser

decodeHTMLSpecialChar   = HTMLParser()

def whatisthis(s):     
    # This function checks to see if a given string is an ordinary string, unicode encoded string or not a string at all

    if isinstance(s, str):
        return "ordinary string"
    elif isinstance(s, unicode):
        return "unicode string"
    else:
        return "not a string"

def DecodeHTMLAndFixEncoding(string_data):
    string_data = decodeHTMLSpecialChar.unescape(string_data)
    encoding_check = whatisthis(string_data)

    if encoding_check != "ordinary string": 
        string_data = string_data.encode("utf-8")

    return string_data

我从各种不同的堆栈溢出解决方案中获得的所有上述代码。

虽然这解决了我遇到的大部分编码问题，但今天我发现了其他问题，我似乎无法弄清楚如何解决。

以下是导致python字符串编码错误或未正确转换html特殊字符的2个不同字符串。

问题案例1：

string1 = "Musekinin Galaxy☆Tylor" 
print(DecodeHTMLAndFixEncoding(string1))
#...Results to "Musekinin Galaxy☆Tylor", however, because I have the name stored as a key within a dictionary to help check if the name has already been stored or not, when referencing the key, I get the following error:

Error Type: <type 'exceptions.KeyError'>
Error Contents: ('Musekinin Galaxy\xe2\x98\x86Tylor',)

我存储数据的字典采用以下格式：

data = {show name (Key):
           {
            description (Key2) : "Overall Description for the show"
            show episode name (Key) : "Description for episode"
           }
       }

问题案例2：

string2 = "Knight&#039;s &amp;#038; Magic"
print(DecodeHTMLAndFixEncoding(string2))
Results to... "Knight's &#038; Magic"

# Although this kind of works it should have resulted to "Knight's & Magic".

我尽力解释我在这里面临的问题，我的主要问题基本上是，有一个简单的解决方案：

首先允许我从字符串中删除符号，表情符号等，以确保它可以用作字典键，以后可以轻松引用而没有任何问题，
其次，比html解析器更好的解决方案来解码特殊的html字符编码，例如问题案例2中显示的问题

我的最后一个请求是，我更喜欢使用stock python提供的默认库或模块与外部库相比，例如beutifulsoup等。但是，如果您认为它们是一些有用的外部模块可以帮助我，那么请随时向我展示这些。

Python Webscrapping字符编码问题

问题案例1：

问题案例2：

0 个答案: