如何使用字符编码将随机字节存储在文件中?

时间:2018-12-21 18:27:36

标签: python random encoding

我正在尝试在Python 3(与Windows 7)上运行其他人的Python 2程序。其目的是生成大阶乘,然后将它们用作随机数流。该程序将十进制阶乘转换为0到255之间的字节值,并将chr(byte value)写入文件。它通过遍历8个小数部分的阶乘来计算每个字节。但是,编码从​​Python 2更改为3(我不确定它到底是什么或为什么重要),并且chr()命令不适用于128到159之间的任何值(但160到255之间的值)工作)-该程序会引发“ UnicodeEncodeError: 'charmap' codec can't encode character '(the character point)' in position 0: character maps to <undefined>

我尝试使用“ open(filename, "w", encoding="utf-8")”更改文件编码,并且此操作成功写入了所有字节。但是,当我测试文件的随机性时,它们的性能明显比作者得到的结果差。

在不影响数据随机性的情况下,应如何更改以存储字符字节?

测试程序称为“ ent”。在命令提示符下,它将文件作为参数,然后输出一些随机性统计信息。有关更多信息,请访问其网站http://www.fourmilab.ch/random/

  • 我的ent结果使用open(filename, "w", encoding="utf-8")从!500,000起文件:

    Entropy = 6.251272 bits per byte.
    
    Optimum compression would reduce the size of this 471812 byte file by 21 percent.
    
    Chi square distribution for 471812 samples is 6545600.65, and randomly
    would exceed this value less than 0.01 percent of the times.
    
    Arithmetic mean value of data bytes is 138.9331 (127.5 = random).
    Monte Carlo value for Pi is 3.173294335 (error 1.01 percent).
    Serial correlation coefficient is 0.162915 (totally uncorrelated = 0.0).
    
  • 作者从500,000英镑的文件中得出的结果:

    Entropy = 7.999373 bits per byte.
    
    Optimum compression would reduce the size of this 313417 byte file by 0 percent.
    
    Chi square distribution for 31347 samples is 272.63, and randomly would
    exceed this value 25.00 percent of the times.
    
    Arithmetic mean value of data bytes is 127.6336 (127.5 = random).
    Monte Carlo value for Pi is 3.149475458 (error 0.25 percent).
    Serial correlation coefficient is -0.001209 (totally uncorrelated = 0.0).
    

2 个答案:

答案 0 :(得分:1)

似乎timakro有了答案(谢谢):

“要写入二进制文件,应在二进制模式下打开open(filename,” wb“)并向其写入类似字节的对象。例如,写入一个值为123的字节:file.write(bytes ([123]))。” -timakro

当我在文件中写入“ bytes([byte value from 0-255])”时,它会获得ent程序所期望的随机分数。因此,我将Python 2的chr()更改为bytes(),以使程序可以在Python 3中存储字节。

答案 1 :(得分:0)

这里有一个示例(在Python 3中):

# check if the characters are matching Unicode
l1 = [chr(i) for i in range(128, 160)]
print("{}\n".format(l1))

s1 = " ".join(l1)

# display these characters for visual comparison
# before writing them to file
print("INITIAL:")
print(s1)

pf = open("somefile", "wb")
pf.write(s1.encode("utf-8"))
pf.close()

po = open("somefile", "rb")
out = po.read()
po.close()

s2 = out.decode('utf-8')

# display these characters for visual comparison    
# after writing them to file and reading them from it
print("AFTER:")
print(s2)  

我们在其中检验两种理论:

  • 是否可以对字符(128到159)进行编码
  • 我们可以将所有数据作为二进制文件写入文件吗?

在第一个演示中,我们可以清楚地看到Unicode字符映射中的数据确实匹配。

关于第二种理论,很明显,我们可以按照原始形式写入和检索二进制数据,如输出所示:

output