所以我正在读取一个与此类似的.txt文件:TTACGATATACGA等,但包含数千个字符。现在我可以读取一个文件,并根据用户输入将其输出为csv,用户输入决定每列的字符数和列数,但每次都会写入一个新文件。
理想情况下,我希望每个文件都有这样的格式:
用户输入4和3。
输出:TCAG,TGCT,TACG,
我的结果是:
TCAGTGCTTACG
我已经尝试过查看字符串拆分,但我似乎无法让它工作。
这是我到目前为止所写的内容,如果它很糟糕就道歉:
#user input for parameters
user_input_character = int(input("Enter how many characters you;d like
per column"))
user_input_column = int(input("Enter how many columns you'd like"))
character_per_column = user_input_character
columns_per_entry = user_input_column
characters_to_read = int((character_per_column * columns_per_entry))
print("Total characters: " + str(characters_to_read))
#counts used to set letters to be taken into intake
index_start = 0
index_finish = characters_to_read
count =1
#open the file to be read
lines = []
test_file = open("dna.txt", "r")
for line in test_file:
line = line.strip()
if not line:
continue
lines.append(',')
#read the file and take note of its size for index purposes
read_file = test_file.read()
file_size = read_file.__len__()
print((file_size))
i = 1
index = 0
#use loop to make more than one file output
while(index < 50):
#print count used to measure progress for testing
print('the count is', count)
count += 1
index += characters_to_read
print('index: ',index)
#intake only uses letters from index count per file
intake = read_file[index_start:index_finish]
print(intake)
index_start += characters_to_read
index_finish +=characters_to_read
#output a txt file with the 4 letters from intake as a individually numbered txt file
text_file_output = open("Output%i.csv"%i,'w')
i += 1
text_file_output.write(intake)
text_file_output.close()
#define path to print to console for file saving
path = os.path.abspath("Output%i")
directory = os.path.dirname(path)
print(path)
test_file.close()
答案 0 :(得分:0)
这是将DNA数据拆分为由指定大小的列和块组成的行的简单方法。它假设DNA数据是单个字符串,没有空格字符(空格,制表符,换行符等)。
要测试此代码,我使用random
模块创建了一些虚假数据。
from random import seed, choice
seed(42)
# Make some random DNA data
num = 66
data = ''.join([choice('ACGT') for _ in range(num)])
print(data, '\n')
# Split the data into chunks, columns and rows
chunksize, cols = 4, 3
row = []
for i in range(0, len(data), chunksize):
chunk = data[i:i+chunksize]
row.append(chunk)
if len(row) == cols:
print(' '.join(row))
row = []
if row:
print(' '.join(row))
<强>输出强>
AAGCCCAATAAACCACTCTGACTGGCCGAATAGGGATATAGGCAACGACATGTGCGGCGACCCTTG
AAGC CCAA TAAA
CCAC TCTG ACTG
GCCG AATA GGGA
TATA GGCA ACGA
CATG TGCG GCGA
CCCT TG
在运行Python 3.6.0的旧2GHz 32位机器上,此代码可以处理并保存到磁盘每秒大约100000个字符(包括生成随机数据所需的时间)。
以上是上述代码的一个版本,用于处理输入数据中的空格和空行。它从文件中读取输入数据并将输出写入CSV文件。
首先,这是我用来创建一些假测试数据的代码,我保存到“dnatest.txt”。
from random import seed, choice, randrange
seed(123)
# Make some random DNA data containing spaces
pool = 'ACGT' * 5 + ' '
for _ in range(15):
# Choose a random line length
size = randrange(50, 70)
data = ''.join([choice(pool) for _ in range(size)])
print(data)
# Randomly add a blank line
if randrange(5) < 2:
print()
这是它创建的文件:
AGCATCACCGGCCAGCGTCACGTAGAGGTCGAAACCGTATCCGATGT AGG
ACC TTACTAC CGTACGGCAGGAGGAGGG TATTACAC CT TCTCACGAGCAAGGAATA
ATTGATGGCACAGC AAGATCCGCTA CCGATTG CAACCA CATACGAT CGACCAGATGG
ACAGAACAGATCTTGGGAATGGAACAGGAGAGAGTGTGGGCCACATTAAAGTGATAAT ATTT
TCTGTCGTGGGGCACCAAACCATGCTAATGCACGACTGGGT GAGGGTTGAGAGCCTACTATCCTCAG
TCGATCGAGATGACCCTCCTATCGCAACAGCTGTCAGTGTCCAGAG ACGTCGC CA
TAGGTCTGGAAAC GCACTCCCCTC GGAATAGTCTACACGAGTCCATTATGTC
GATCTGACTATGGGGACCATAACGGCTATGCGACCATGGACTGGTTCGAG
GATTCCCGTTCTACAT CACCTT ACCTCTGATAA CGACTGGTTCGA GGGTCTC CC
AAA CGTCTATTATGTCATAACGTAACTCTGC CGTAGTTTGATCAAACGTACAGCCACCAC
TGAAGC CGCCTCGAACCGCGTCCGACCCTGGGGAGCCTGGGGCCCAGCA
CCTTAGC ACTGCGA AGCTACACCCCACGAGTAATTTG T CTATCGT CCG
GCCTCGTTTCCTTGTGAAATTAT ATGGT C AGTCTTCAATCAA CACCTA CTAATAA
GTGCTAGC CCGGGGATCTTGTCCTGGTCCA GGTC AT AATCCGTGCTCAAATTACATGGCTT
TTAGTAATGAGTTCGGGC GCGCCCTCAAAGTTGGTCTAGAAGCGCGCAGTTTTCCTTAGGT
以下是处理该数据的代码:
# Input & output file names
iname = 'dnatest.txt'
oname = 'dnatest.csv'
# Read the data and eliminate all whitespace
with open(iname) as f:
data = ''.join(f.read().split())
# Split the data into chunks, columns and rows
chunksize, cols = 4, 3
with open(oname, 'w') as f:
row = []
for i in range(0, len(data), chunksize):
chunk = data[i:i+chunksize]
row.append(chunk)
if len(row) == cols:
f.write(', '.join(row) + '\n')
row = []
if row:
f.write(', '.join(row) + '\n')
这是它创建的文件:
AGCA, TCAC, CGGC
CAGC, GTCA, CGTA
GAGG, TCGA, AACC
GTAT, CCGA, TGTA
GGAC, CTTA, CTAC
CGTA, CGGC, AGGA
GGAG, GGTA, TTAC
ACCT, TCTC, ACGA
GCAA, GGAA, TAAT
TGAT, GGCA, CAGC
AAGA, TCCG, CTAC
CGAT, TGCA, ACCA
CATA, CGAT, CGAC
CAGA, TGGA, CAGA
ACAG, ATCT, TGGG
AATG, GAAC, AGGA
GAGA, GTGT, GGGC
CACA, TTAA, AGTG
ATAA, TATT, TTCT
GTCG, TGGG, GCAC
CAAA, CCAT, GCTA
ATGC, ACGA, CTGG
GTGA, GGGT, TGAG
AGCC, TACT, ATCC
TCAG, TCGA, TCGA
GATG, ACCC, TCCT
ATCG, CAAC, AGCT
GTCA, GTGT, CCAG
AGAC, GTCG, CCAT
AGGT, CTGG, AAAC
GCAC, TCCC, CTCG
GAAT, AGTC, TACA
CGAG, TCCA, TTAT
GTCG, ATCT, GACT
ATGG, GGAC, CATA
ACGG, CTAT, GCGA
CCAT, GGAC, TGGT
TCGA, GGAT, TCCC
GTTC, TACA, TCAC
CTTA, CCTC, TGAT
AACG, ACTG, GTTC
GAGG, GTCT, CCCA
AACG, TCTA, TTAT
GTCA, TAAC, GTAA
CTCT, GCCG, TAGT
TTGA, TCAA, ACGT
ACAG, CCAC, CACT
GAAG, CCGC, CTCG
AACC, GCGT, CCGA
CCCT, GGGG, AGCC
TGGG, GCCC, AGCA
CCTT, AGCA, CTGC
GAAG, CTAC, ACCC
CACG, AGTA, ATTT
GTCT, ATCG, TCCG
GCCT, CGTT, TCCT
TGTG, AAAT, TATA
TGGT, CAGT, CTTC
AATC, AACA, CCTA
CTAA, TAAG, TGCT
AGCC, CGGG, GATC
TTGT, CCTG, GTCC
AGGT, CATA, ATCC
GTGC, TCAA, ATTA
CATG, GCTT, TTAG
TAAT, GAGT, TCGG
GCGC, GCCC, TCAA
AGTT, GGTC, TAGA
AGCG, CGCA, GTTT
TCCT, TAGG, T