Question

我想将.html文件作为原始文本读取，并将包含unicode字符的子字符串的实例替换为另一个子字符串。假设文件mm03.html只包含一行文本：

<span style='font-size:14.0pt'>«test»</span>

我想阅读mm03.html，将其原始文本解析为字符串，并调用replace以使输出看起来像这样：

<span style='font-size:14.0pt'>TEST</span>

我第一次尝试这样做时，我写了以下代码......

# -*- coding: utf-8 -*-
import codecs
htmlBase = codecs.open("mm03.html",'r')
htmlFill = htmlBase.read()
print htmlFill
htmlFill = htmlFill.replace("«test»","TEST")
print htmlFill
htmlBase.close()

...期望首先打印上面列出的原始行，然后是第二行。相反，它列出了第一行两次。

好。所以它可能是一个Unicode解码问题，对吧？也许，但是当我根据遍布本网站的Unicode相关建议修改代码时，不同阴影的问题仍然存在。此外，通过将htmlBase明确定义为...

，可以实现所需的功能

htmlBase = """<span style='font-size:14.0pt'>«test»</span>"""

...这让我相信我在Python中读取html文件时所知道的一些东西。我试过在＆＃39; w＆＃39;中打开mmo3.html。模式，但似乎不起作用，往往会破坏原始文件。从只读文件读取的字符串本身应该是只读的，但我可能错了，这没有多大意义。

以下是我已经咀嚼过的几个脚本/输出对。

添加不带引号的字符＆＃39; u＆＃39;在我想要替换的字符串之前

# -*- coding: utf-8 -*-
import codecs
htmlBase = codecs.open("mm03.html",'r')
htmlFill = htmlBase.read()
print htmlFill
htmlFill = htmlFill.replace(u"«test»","TEST")
print htmlFill
htmlBase.close()

输出：

<span style='font-size:14.0pt'>½test╗</span>
Traceback (most recent call last):
  File "test2.py", line 6, in <module>
    htmlFill = htmlFill.replace(u"┬½test┬╗","TEST")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 31: ordinal not in range(128)

将.decode（＆＃39; utf-8＆＃39;）应用于从.read（）传递的字符串

# -*- coding: utf-8 -*-
import codecs
htmlBase = codecs.open("mm03.html",'r')
htmlFill = htmlBase.read().decode('utf-8')
print htmlFill
htmlFill = htmlFill.replace(u"«test»","TEST")
print htmlFill
htmlBase.close()

输出：

Traceback (most recent call last):
  File "test2.py", line 4, in <module>
    htmlFill = htmlBase.read().decode('utf-8')
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xab in position 31: invalid start byte

将.encode（＆＃39; utf-8＆＃39;）应用于从.read（）传递的字符串

# -*- coding: utf-8 -*-
import codecs
htmlBase = codecs.open("mm03.html",'r')
htmlFill = htmlBase.read().encode('utf-8')
print htmlFill
htmlFill = htmlFill.replace(u"«test»","TEST")
print htmlFill
htmlBase.close()

输出：

Traceback (most recent call last):
  File "test2.py", line 4, in <module>
    htmlFill = htmlBase.read().encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 31: ordinal not in range(128)

将.decode（＆＃39; utf-8＆＃39;）应用于从.read（）传递的字符串，而不使用＆＃39; u＆＃39;目标子字符串上的后缀

# -*- coding: utf-8 -*-
import codecs
htmlBase = codecs.open("mm03.html",'r')
htmlFill = htmlBase.read().decode('utf-8')
print htmlFill
htmlFill = htmlFill.replace("«test»","TEST")
print htmlFill
htmlBase.close()

输出：

Traceback (most recent call last):
  File "test2.py", line 4, in <module>
    htmlFill = htmlBase.read().decode('utf-8')
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xab in position 31: invalid start byte

将.encode（＆＃39; utf-8＆＃39;）应用于从.read（）传递的字符串，而不使用＆＃39; u＆＃39;目标子字符串上的后缀

# -*- coding: utf-8 -*-
import codecs
htmlBase = codecs.open("mm03.html",'r')
htmlFill = htmlBase.read().encode('utf-8')
print htmlFill
htmlFill = htmlFill.replace("«test»","TEST")
print htmlFill
htmlBase.close()

输出：

Traceback (most recent call last):
  File "test2.py", line 4, in <module>
    htmlFill = htmlBase.read().encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 31: ordinal not in range(128)

Answer 1

在将字符串传递给str.replace()之前，您需要解码要替换的字符串。这对我有用：

# -*- coding: utf-8 -*-
import codecs
htmlBase = codecs.open("mm03.html",'r')
htmlFill = htmlBase.read()
htmlFill = codecs.decode(htmlFill,'utf-8')
substr = codecs.decode("«test»",'utf-8')
htmlFill = htmlFill.replace(substr,"TEST")
print htmlFill
htmlBase.close()

对从包含Unicode的HTML文件读取的字符串使用.replace（）方法

1 个答案: