Question

我正在尝试从网站上获取带有特殊字符的文本，因此Python返回的字符串中充满了“\ x”字符。但是，似乎编码是错误的。例如，在获取时：

th =urllib2.urlopen('http://norse.ulver.com/dct/zoega/th.html')

网页<h1>级别的行应包含字母“Þ”，其字节编号为C39E，Unicode代码为DE http://www.fileformat.info/info/charset/UTF-8/list.htm

相反，我得到了

'<h1>\xc3\x9e</h1>'

将字节数分成两部分，这样当将行写入文件然后用Unicode编码打开它时，我得到“Ã”而不是“Þ”。

如何强制Python将这样的字符编码为\uC39E或\xde而不是\xc3\x9e？

Answer 1

这是U + 00DE的正确UTF-8 字节编码，它需要两个字节来表示它（hide()和<?php if($_SERVER['REQUEST_METHOD'] == 'POST') { // Make connection to database first // example from http://www.w3schools.com/php/php_mysql_insert.asp $servername = "localhost"; $username = "username"; $password = "password"; $dbname = "myDB"; // Create connection $conn = new mysqli($servername, $username, $password, $dbname); // Check connection if ($conn->connect_error) { die("Connection failed: " . $conn->connect_error); } // Use DomDocument (more options / flexibility) $doc = new DOMDocument(); $doc->loadXML($_POST['zitem']); $items = $dom->getElementsByTagName('ITEM'); foreach ($items as $item) { // The attributed in the XML item "ITEM", can be retrieved by using $item->getAttribute('nameofattribute') $id = $item->getAttribute('id'); $name = $item->getAttribute('name'); $mesh_name = $item->getAttribute('mesh_name'); // Some logic, see for yourself. $cashitem = false; if ($item->getAttribute('iscashitem') == true) { $cashitem = true; } $staffitem = false; if ($item->getAttribute('isstaffitem') == true) { $staffitem = true; } // Insert the item from XML into MySQL-table $sql = " INSERT INTO Items (id, name, mesh_name, cashitem, staffitem) VALUES ( '" . $mysqli->real_escape_string($id) . "', '" . $mysqli->real_escape_string($name) . "', '" . $mysqli->real_escape_string($mesh_name) . "', 'somevaluehere', 'somevaluehere' )"; if ($conn->query($sql) === TRUE) { echo "New record created successfully"; } else { echo "Error: " . $sql . "<br>" . $conn->error; } } $conn->close(); } ?> <form method="post"> <textarea name="zitem"></textarea> <input type="submit" value="Submit your XML file" /> </form>），但是你需要将其解码为Unicode以查看Unicode代码点：

\xc3

以上是显示正确Unicode代码点的Unicode字符串。在UTF-8控制台上打印：

\x9e

如果您使用错误的编码进行解码，则会获得不同的Unicode代码点。在这种情况下，U + 00C3和U + 017E。 >>> '<h1>\xc3\x9e</h1>'.decode('utf8') u'<h1>\xde</h1>'是Unicode代码点的Unicode字符串中的转义码＆lt; U + 0100而>>> print '<h1>\xc3\x9e</h1>'.decode('utf8') <h1>Þ</h1>是一个用于码点＆lt; U + 10000：

\xc3

Python获取UTF-8字符的错误编码？

1 个答案: