我希望将一个文件转换为二进制文件,最好使用Python,因为我最熟悉它,但如果走过,我可能会使用另一种语言。
基本上,我需要这个项目,我正在处理我们想要使用DNA链存储数据的项目,因此需要以二进制文件存储文件(' A
' T
' = 0
,' G
& #39; s' C
' s = 1
)
知道我怎么办?我确实发现使用可以在 base64
中进行编码,然后对其进行解码,但它看起来效率有点低,而且我所拥有的代码似乎无法正常工作... < / p>
import base64
import tkinter as tk
from tkinter import filedialog
root = tk.Tk()
root.withdraw()
file_path = filedialog.askopenfilename()
print(file_path)
with open(file_path) as f:
encoded = base64.b64encode(f.readlines())
print(encoded)
另外,我已经有一个程序可以简单地用文本来完成。任何有关如何改进它的提示也将不胜感激!
import binascii
t = bytearray(str(input("Texte?")), 'utf8')
h = binascii.hexlify(t)
b = bin(int(h, 16)).replace('b','')
#removing the b that appears in the end for some reason
g = b.replace('1','G').replace('0','A')
print(g)
例如,如果我输入测试: 好的,因为文本到DNA: 我输入&#39; test&#39;并期望来自二进制的DNA序列 二进制是:01110100011001010111001101110100(我也要求打印示例中的每个转换,以便更容易理解)
>>>Texte?test #Asks the text
>>>b'74657374' #converts to hex
>>>01110100011001010111001101110100 #converts to binary
>>>AGGGAGAAAGGAAGAGAGGGAAGGAGGGAGAA #converts 0 to A and 1 to G
答案 0 :(得分:0)
当然,这是低效的!
base64
旨在将二进制文件存储在文本中。转换后它会产生更大的尺寸块。
btw:你想要什么样的效率?紧凑?
如果是这样:第二个样本离您想要的更近
btw:在你的任务中你丢失了信息!你知道吗?
以下是如何存储和恢复的示例。
它以易于理解的Hex-In-Text
格式存储数据 - 仅仅是为了演示。如果您想要紧凑 - 您可以轻松修改代码以便存储在二进制文件中,或者如果您想要00011001
视图 - 修改也很容易。
import math
#"make a long test string"
import numpy as np
s=''.join((str(x) for x in np.random.randint(4,size=33)))\
.replace('0','A').replace('1','T').replace('2','G').replace('3','C')
def store_(s):
size=len(s) #size will changed to fit 8*integer so remember true value of it and store with data
s2=s.replace('A','0').replace('T','0').replace('G','1').replace('C','1')\
.ljust( int(math.ceil(size/8.)*8),'0') #add '0' to 8xInt to the right
a=(hex( eval('0b'+s2[i*8:i*8+8]) )[2:].rjust(2,'0') for i in xrange(len(s2)/8))
return ''.join(a),size
yourDataAsHexInText,sizeToStore=store_(s)
print yourDataAsHexInText,sizeToStore
def restore_(s,size=None):
if size==None: size=len(s)/2
a=( bin(eval('0x'+s[i*2:i*2+2]))[2:].rjust(8,'0') for i in xrange(len(s)/2))
#you loose information, remember?, so it`s only A or G
return (''.join(a).replace('1','G').replace('0','A') )[:size]
restore_(yourDataAsHexInText,sizeToStore)
print "so check it"
print s ,"(input)"
print store_(s)
print s.replace('C','G').replace('T','A') ,"to compare with information loss"
print restore_(*store_(s)),"restored"
print s.replace('C','G').replace('T','A') == restore_(*store_(s))
导致我的测试:
63c9308a00 33
so check it
AGCAATGCCGATGTTCATCGTATACTTTGACTA (input)
('63c9308a00', 33)
AGGAAAGGGGAAGAAGAAGGAAAAGAAAGAGAA to compare with information loss
AGGAAAGGGGAAGAAGAAGGAAAAGAAAGAGAA restored
True
答案 1 :(得分:0)
所以,感谢@jonrshape和Sergey Vturin,我终于能够实现我想要的! 我的程序要求一个文件,把它变成二进制文件,然后给我一个等价的&#34; DNA代码&#34;使用二进制数对(00 = A,01 = T,10 = G,11 = C)
import binascii
from tkinter import filedialog
file_path = filedialog.askopenfilename()
x = ""
with open(file_path, 'rb') as f:
for chunk in iter(lambda: f.read(32), b''):
x += str(binascii.hexlify(chunk)).replace("b","").replace("'","")
b = bin(int(x, 16)).replace('b','')
g = [b[i:i+2] for i in range(0, len(b), 2)]
dna = ""
for i in g:
if i == "00":
dna += "A"
elif i == "01":
dna += "T"
elif i == "10":
dna += "G"
elif i == "11":
dna += "C"
print(x) #hexdump
print(b) #converted to binary
print(dna) #converted to "DNA"