我想将.html文件作为原始文本读取,并将包含unicode字符的子字符串的实例替换为另一个子字符串。假设文件mm03.html
只包含一行文本:
<span style='font-size:14.0pt'>«test»</span>
我想阅读mm03.html
,将其原始文本解析为字符串,并调用replace以使输出看起来像这样:
<span style='font-size:14.0pt'>TEST</span>
我第一次尝试这样做时,我写了以下代码......
# -*- coding: utf-8 -*-
import codecs
htmlBase = codecs.open("mm03.html",'r')
htmlFill = htmlBase.read()
print htmlFill
htmlFill = htmlFill.replace("«test»","TEST")
print htmlFill
htmlBase.close()
...期望首先打印上面列出的原始行,然后是第二行。相反,它列出了第一行两次。
好。所以它可能是一个Unicode解码问题,对吧?也许,但是当我根据遍布本网站的Unicode相关建议修改代码时,不同阴影的问题仍然存在。此外,通过将htmlBase明确定义为...
,可以实现所需的功能htmlBase = """<span style='font-size:14.0pt'>«test»</span>"""
...这让我相信我在Python中读取html文件时所知道的一些东西。我试过在&#39; w&#39;中打开mmo3.html。模式,但似乎不起作用,往往会破坏原始文件。从只读文件读取的字符串本身应该是只读的,但我可能错了,这没有多大意义。
以下是我已经咀嚼过的几个脚本/输出对。
添加不带引号的字符&#39; u&#39;在我想要替换的字符串之前
# -*- coding: utf-8 -*-
import codecs
htmlBase = codecs.open("mm03.html",'r')
htmlFill = htmlBase.read()
print htmlFill
htmlFill = htmlFill.replace(u"«test»","TEST")
print htmlFill
htmlBase.close()
输出:
<span style='font-size:14.0pt'>½test╗</span>
Traceback (most recent call last):
File "test2.py", line 6, in <module>
htmlFill = htmlFill.replace(u"«test»","TEST")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 31: ordinal not in range(128)
将.decode(&#39; utf-8&#39;)应用于从.read()传递的字符串
# -*- coding: utf-8 -*-
import codecs
htmlBase = codecs.open("mm03.html",'r')
htmlFill = htmlBase.read().decode('utf-8')
print htmlFill
htmlFill = htmlFill.replace(u"«test»","TEST")
print htmlFill
htmlBase.close()
输出:
Traceback (most recent call last):
File "test2.py", line 4, in <module>
htmlFill = htmlBase.read().decode('utf-8')
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xab in position 31: invalid start byte
将.encode(&#39; utf-8&#39;)应用于从.read()传递的字符串
# -*- coding: utf-8 -*-
import codecs
htmlBase = codecs.open("mm03.html",'r')
htmlFill = htmlBase.read().encode('utf-8')
print htmlFill
htmlFill = htmlFill.replace(u"«test»","TEST")
print htmlFill
htmlBase.close()
输出:
Traceback (most recent call last):
File "test2.py", line 4, in <module>
htmlFill = htmlBase.read().encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 31: ordinal not in range(128)
将.decode(&#39; utf-8&#39;)应用于从.read()传递的字符串,而不使用&#39; u&#39;目标子字符串上的后缀
# -*- coding: utf-8 -*-
import codecs
htmlBase = codecs.open("mm03.html",'r')
htmlFill = htmlBase.read().decode('utf-8')
print htmlFill
htmlFill = htmlFill.replace("«test»","TEST")
print htmlFill
htmlBase.close()
输出:
Traceback (most recent call last):
File "test2.py", line 4, in <module>
htmlFill = htmlBase.read().decode('utf-8')
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xab in position 31: invalid start byte
将.encode(&#39; utf-8&#39;)应用于从.read()传递的字符串,而不使用&#39; u&#39;目标子字符串上的后缀
# -*- coding: utf-8 -*-
import codecs
htmlBase = codecs.open("mm03.html",'r')
htmlFill = htmlBase.read().encode('utf-8')
print htmlFill
htmlFill = htmlFill.replace("«test»","TEST")
print htmlFill
htmlBase.close()
输出:
Traceback (most recent call last):
File "test2.py", line 4, in <module>
htmlFill = htmlBase.read().encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 31: ordinal not in range(128)
答案 0 :(得分:0)
在将字符串传递给str.replace()
之前,您需要解码要替换的字符串。这对我有用:
# -*- coding: utf-8 -*-
import codecs
htmlBase = codecs.open("mm03.html",'r')
htmlFill = htmlBase.read()
htmlFill = codecs.decode(htmlFill,'utf-8')
substr = codecs.decode("«test»",'utf-8')
htmlFill = htmlFill.replace(substr,"TEST")
print htmlFill
htmlBase.close()