Question

我希望将一个文件转换为二进制文件，最好使用Python，因为我最熟悉它，但如果走过，我可能会使用另一种语言。

基本上，我需要这个项目，我正在处理我们想要使用DNA链存储数据的项目，因此需要以二进制文件存储文件（＆＃39; A ＆＃39; T ＆＃39; = 0，＆＃39; G ＆＃39; s＆＃39; C ＆＃39; s = 1）

知道我怎么办？我确实发现使用可以在 base64 中进行编码，然后对其进行解码，但它看起来效率有点低，而且我所拥有的代码似乎无法正常工作... < / p>

import base64
import tkinter as tk
from tkinter import filedialog

root = tk.Tk()
root.withdraw()
file_path = filedialog.askopenfilename()
print(file_path)
with open(file_path) as f:
    encoded = base64.b64encode(f.readlines())
    print(encoded)

另外，我已经有一个程序可以简单地用文本来完成。任何有关如何改进它的提示也将不胜感激！

import binascii
t = bytearray(str(input("Texte?")), 'utf8')
h = binascii.hexlify(t)
b = bin(int(h, 16)).replace('b','') 
#removing the b that appears in the end for some reason
g = b.replace('1','G').replace('0','A')
print(g)

例如，如果我输入测试：好的，因为文本到DNA：我输入＆＃39; test＆＃39;并期望来自二进制的DNA序列二进制是：01110100011001010111001101110100（我也要求打印示例中的每个转换，以便更容易理解）

>>>Texte?test #Asks the text
>>>b'74657374' #converts to hex
>>>01110100011001010111001101110100 #converts to binary
>>>AGGGAGAAAGGAAGAGAGGGAAGGAGGGAGAA #converts 0 to A and 1 to G

Answer 1

当然，这是低效的！ base64 旨在将二进制文件存储在文本中。转换后它会产生更大的尺寸块。

btw：你想要什么样的效率？紧凑？

如果是这样：第二个样本离您想要的更近

btw：在你的任务中你丢失了信息！你知道吗？

以下是如何存储和恢复的示例。

它以易于理解的Hex-In-Text格式存储数据 - 仅仅是为了演示。如果您想要紧凑 - 您可以轻松修改代码以便存储在二进制文件中，或者如果您想要00011001视图 - 修改也很容易。

import math
#"make a long test string"
import numpy as np
s=''.join((str(x) for x in np.random.randint(4,size=33)))\
    .replace('0','A').replace('1','T').replace('2','G').replace('3','C')

def store_(s):
    size=len(s) #size will changed to fit 8*integer so remember true value of it and store with data
    s2=s.replace('A','0').replace('T','0').replace('G','1').replace('C','1')\
        .ljust( int(math.ceil(size/8.)*8),'0') #add '0' to 8xInt to the right
    a=(hex( eval('0b'+s2[i*8:i*8+8]) )[2:].rjust(2,'0') for i in xrange(len(s2)/8))
    return ''.join(a),size

yourDataAsHexInText,sizeToStore=store_(s)
print yourDataAsHexInText,sizeToStore


def restore_(s,size=None):
    if size==None: size=len(s)/2
    a=( bin(eval('0x'+s[i*2:i*2+2]))[2:].rjust(8,'0') for i in xrange(len(s)/2))
    #you loose information, remember?, so it`s only A or G
    return (''.join(a).replace('1','G').replace('0','A') )[:size]

restore_(yourDataAsHexInText,sizeToStore)


print "so check it"
print s ,"(input)"
print store_(s)
print s.replace('C','G').replace('T','A') ,"to compare with information loss"
print restore_(*store_(s)),"restored"
print s.replace('C','G').replace('T','A') == restore_(*store_(s))

导致我的测试：

63c9308a00 33
so check it
AGCAATGCCGATGTTCATCGTATACTTTGACTA (input)
('63c9308a00', 33)
AGGAAAGGGGAAGAAGAAGGAAAAGAAAGAGAA to compare with information loss
AGGAAAGGGGAAGAAGAAGGAAAAGAAAGAGAA restored
True

Answer 2

所以，感谢@jonrshape和Sergey Vturin，我终于能够实现我想要的！我的程序要求一个文件，把它变成二进制文件，然后给我一个等价的＆＃34; DNA代码＆＃34;使用二进制数对（00 = A，01 = T，10 = G，11 = C）

import binascii
from tkinter import filedialog

file_path = filedialog.askopenfilename()

x = ""
with open(file_path, 'rb') as f:
    for chunk in iter(lambda: f.read(32), b''):
        x += str(binascii.hexlify(chunk)).replace("b","").replace("'","")
b = bin(int(x, 16)).replace('b','')
g = [b[i:i+2] for i in range(0, len(b), 2)]
dna = ""
for i in g:
    if i == "00":
        dna += "A"
    elif i == "01":
        dna += "T"
    elif i == "10":
        dna += "G"
    elif i == "11":
        dna += "C"
print(x) #hexdump
print(b) #converted to binary
print(dna) #converted to "DNA"

在Python

2 个答案: