我有一个用制表符分隔的文本文件,例如以下示例:
infile
:
chr1 + 1071396 1271396 LOC
chr12 + 1101483 1121483 MIR200B
我想将columns 3 and 4
中infile
之间的差异除以100,并在infile
中每行制作100行,并创建一个名为newfile
的新文件。
并制作包含6列的最终tab separated
文件。前5列类似于infile
,第6列是(第5列)_part number(数字是1到100)。
这是预期的输出文件:
expected output
:
chr1 + 1071396 1073396 LOC LOC_part1
chr1 + 1073396 1075396 LOC LOC_part2
.
.
.
chr1 + 1269396 1271396 LOC LOC_part100
chr12 + 1101483 1101683 MIR200B MIR200B_part1
chr12 + 1101683 1101883 MIR200B MIR200B_part2
.
.
.
chr12 + 1121283 1121483 MIR200B MIR200B_part100
我编写了以下代码来获得预期的输出,但未返回预期的结果。实际上,使用以下代码的输出的第3列和第4列不正确。问题是2nd
代码段。
file = open('infile.txt', 'rb')
cont = []
for line in file:
cont.append(list(filter(lambda x: not x.isspace(), line.split('\t'))))
new = []
for i in cont:
new.append([s.replace('\n', '') for s in i])
newfile = []
for i in new:
diff= (int(i[3])-int(i[2]))/100
left = int(i[2])
right = int(i[2]) + diff
for j in range(100):
add = [i[0], i[1], left, right, i[4],str(i[4])+'_part' + str(j)]
newfile.append(add)
with open('output.txt', 'w') as f:
for i in newfile:
for j in i:
f.write(i + '\n')
您知道如何解决该问题吗?
答案 0 :(得分:0)
首先,您不需要每次迭代都计算diff
的值,因为它总是一样的。只需计算一次并重复使用即可。
此外,只有两条兴趣线,您可以使用split
轻松阅读并string.split()
,
这是一个一般示例
x = 'chr1 + 1071396 1271396 LOC' # assuming we are reading this from file
x = x.split() # it gives you a list
left_num = int(x[2]) # convert numbers to int
right_num = int(x[3])
diff= (right_num-left_num)/100 # get the difference only once
last_column = x[4] + "_part" # generate last column
with open("output.txt", "w+") as op_file: # open file to write
op_file.write('{}\t{}\t{}\t{}\t{}\t{}\n'.format(x[0], x[1], left_num, right_num, x[4], last_column + str(1))) # write first line
for num in range(2,101):
temp = int(right_num) # temporary container to hold right value
right_num = int(right_num + diff) # calc difference
op_file.write('{}\t{}\t{}\t{}\t{}\t{}\n'.format(x[0], x[1], temp, right_num, x[4], last_column + str(num)))
这会给你
chr1 + 1071396 1271396 LOC LOC_part1
chr1 + 1271396 1273396 LOC LOC_part2
chr1 + 1273396 1275396 LOC LOC_part3
chr1 + 1275396 1277396 LOC LOC_part4
chr1 + 1277396 1279396 LOC LOC_part5
chr1 + 1279396 1281396 LOC LOC_part6
chr1 + 1281396 1283396 LOC LOC_part7
chr1 + 1283396 1285396 LOC LOC_part8
chr1 + 1285396 1287396 LOC LOC_part9
chr1 + 1287396 1289396 LOC LOC_part10
chr1 + 1289396 1291396 LOC LOC_part11
chr1 + 1291396 1293396 LOC LOC_part12
chr1 + 1293396 1295396 LOC LOC_part13
chr1 + 1295396 1297396 LOC LOC_part14
chr1 + 1297396 1299396 LOC LOC_part15
chr1 + 1299396 1301396 LOC LOC_part16
chr1 + 1301396 1303396 LOC LOC_part17
chr1 + 1303396 1305396 LOC LOC_part18
chr1 + 1305396 1307396 LOC LOC_part19
chr1 + 1307396 1309396 LOC LOC_part20
chr1 + 1309396 1311396 LOC LOC_part21
chr1 + 1311396 1313396 LOC LOC_part22
chr1 + 1313396 1315396 LOC LOC_part23
chr1 + 1315396 1317396 LOC LOC_part24
chr1 + 1317396 1319396 LOC LOC_part25
chr1 + 1319396 1321396 LOC LOC_part26
chr1 + 1321396 1323396 LOC LOC_part27
chr1 + 1323396 1325396 LOC LOC_part28
chr1 + 1325396 1327396 LOC LOC_part29
chr1 + 1327396 1329396 LOC LOC_part30
chr1 + 1329396 1331396 LOC LOC_part31
chr1 + 1331396 1333396 LOC LOC_part32
chr1 + 1333396 1335396 LOC LOC_part33
chr1 + 1335396 1337396 LOC LOC_part34
chr1 + 1337396 1339396 LOC LOC_part35
chr1 + 1339396 1341396 LOC LOC_part36
chr1 + 1341396 1343396 LOC LOC_part37
chr1 + 1343396 1345396 LOC LOC_part38
chr1 + 1345396 1347396 LOC LOC_part39
chr1 + 1347396 1349396 LOC LOC_part40
chr1 + 1349396 1351396 LOC LOC_part41
chr1 + 1351396 1353396 LOC LOC_part42
chr1 + 1353396 1355396 LOC LOC_part43
chr1 + 1355396 1357396 LOC LOC_part44
chr1 + 1357396 1359396 LOC LOC_part45
chr1 + 1359396 1361396 LOC LOC_part46
chr1 + 1361396 1363396 LOC LOC_part47
chr1 + 1363396 1365396 LOC LOC_part48
chr1 + 1365396 1367396 LOC LOC_part49
chr1 + 1367396 1369396 LOC LOC_part50
chr1 + 1369396 1371396 LOC LOC_part51
chr1 + 1371396 1373396 LOC LOC_part52
chr1 + 1373396 1375396 LOC LOC_part53
chr1 + 1375396 1377396 LOC LOC_part54
chr1 + 1377396 1379396 LOC LOC_part55
chr1 + 1379396 1381396 LOC LOC_part56
chr1 + 1381396 1383396 LOC LOC_part57
chr1 + 1383396 1385396 LOC LOC_part58
chr1 + 1385396 1387396 LOC LOC_part59
chr1 + 1387396 1389396 LOC LOC_part60
chr1 + 1389396 1391396 LOC LOC_part61
chr1 + 1391396 1393396 LOC LOC_part62
chr1 + 1393396 1395396 LOC LOC_part63
chr1 + 1395396 1397396 LOC LOC_part64
chr1 + 1397396 1399396 LOC LOC_part65
chr1 + 1399396 1401396 LOC LOC_part66
chr1 + 1401396 1403396 LOC LOC_part67
chr1 + 1403396 1405396 LOC LOC_part68
chr1 + 1405396 1407396 LOC LOC_part69
chr1 + 1407396 1409396 LOC LOC_part70
chr1 + 1409396 1411396 LOC LOC_part71
chr1 + 1411396 1413396 LOC LOC_part72
chr1 + 1413396 1415396 LOC LOC_part73
chr1 + 1415396 1417396 LOC LOC_part74
chr1 + 1417396 1419396 LOC LOC_part75
chr1 + 1419396 1421396 LOC LOC_part76
chr1 + 1421396 1423396 LOC LOC_part77
chr1 + 1423396 1425396 LOC LOC_part78
chr1 + 1425396 1427396 LOC LOC_part79
chr1 + 1427396 1429396 LOC LOC_part80
chr1 + 1429396 1431396 LOC LOC_part81
chr1 + 1431396 1433396 LOC LOC_part82
chr1 + 1433396 1435396 LOC LOC_part83
chr1 + 1435396 1437396 LOC LOC_part84
chr1 + 1437396 1439396 LOC LOC_part85
chr1 + 1439396 1441396 LOC LOC_part86
chr1 + 1441396 1443396 LOC LOC_part87
chr1 + 1443396 1445396 LOC LOC_part88
chr1 + 1445396 1447396 LOC LOC_part89
chr1 + 1447396 1449396 LOC LOC_part90
chr1 + 1449396 1451396 LOC LOC_part91
chr1 + 1451396 1453396 LOC LOC_part92
chr1 + 1453396 1455396 LOC LOC_part93
chr1 + 1455396 1457396 LOC LOC_part94
chr1 + 1457396 1459396 LOC LOC_part95
chr1 + 1459396 1461396 LOC LOC_part96
chr1 + 1461396 1463396 LOC LOC_part97
chr1 + 1463396 1465396 LOC LOC_part98
chr1 + 1465396 1467396 LOC LOC_part99
chr1 + 1467396 1469396 LOC LOC_part100