我正在尝试用pdf文件中的ASCII表示替换十六进制表示(#..)
import re
with open("C:\\Users\\Suleiman JK\\Desktop\\test\\hello-world-malformed.pdf","rb") as file1:
stuff = file1.read()
stuff = re.sub("#([0-9A-Fa-f]{2})",lambda m:unichr(int(m.groups()[0],16)),stuff)
with open("C:\\Users\\Suleiman JK\\Desktop\\test\\hello-world-malformed.pdf","wb") as file1:
file1.write(stuff)
file1 = open("C:\\Users\\Suleiman JK\\Desktop\\test\\hello-world-malformed.pdf")
print file1.read()
当我使用“Geany”运行它时,它给出了以下错误:
Traceback (most recent call last):
File "testing.py", line 41, in <module>
main()
File "testing.py", line 31, in main
stuff = re.sub("#([0-9A-Fa-f]{2})",lambda m:unichr(int(m.groups()[0],16)),stuff)
File "C:\Python27\lib\re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x84 in position 239: ordinal not in range(128)
答案 0 :(得分:0)
不要使用unichr()
;它会生成一个包含一个字符的unicode字符串。不要混合使用Unicode字符串和字节字符串(二进制数据),因为这会触发隐式编码或解码。这里隐式解码被触发并失败。
您的代码点限制为0-255,因此简单的chr()
将执行:
stuff = re.sub("#([0-9A-Fa-f]{2})", lambda m: chr(int(m.group(0), 16)), stuff)