我有两个数据文件(datafile1和datafile2),我想将一些信息从datafile2添加到datafile1,但前提是要满足某些要求,然后将所有信息写入新文件。
这是datafile1的示例(我更改了选项卡,以便于查看):
#OTU S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 Seq
OTU49 0 0 0 0 0 16 0 0 0 0 0 0 1 0 0 0 0 0 catat
OTU171 5 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 gattt
OTU803 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 aactt
OTU2519 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 aattt
以下是datafile2的示例:
#GInumber OTU Accssn Ident Len M Gap Qs Qe Ss Se evalue bit phylum class order family genus species
1366104624 OTU49 MG926900 82.911 158 23 4 2 157 18 173 2.17e-29 139 Arthropoda Insecta Hymenoptera Braconidae Leiophron NA
342734543 OTU171 JN305047 95.513 156 7 0 2 157 23 178 9.63e-63 250 Arthropoda Insecta Lepidoptera Limacodidae Euphobetron Euphobetron cupreitincta
290756623 OTU803 GU580785 96.753 154 5 0 4 157 10 163 5.75e-65 257 Arthropoda Insecta Lepidoptera Geometridae Apocheima Apocheima pilosaria
296792336 OTU2519 GU688553 98.039 153 3 0 1 153 18 170 9.56e-68 267 Arthropoda Insecta Lepidoptera Geometridae Operophtera Operophtera brumata
我要对datafile1的每一行进行操作,在datafile2中找到具有相同“ OTU”的行,并从datafile 2中始终添加GInumber,Accsn,Ident,Len,M,Gap,Qs,Qe, Ss,Se,evalue,bit,门和类。如果Ident介于某些数字之间,那么我还要根据以下条件添加顺序,科,属和物种:
Case #1: Ident > 98.0, add order, family, genus, and species
Case #2: Ident between 96.5 and 98.0, add order, family, "NA", "NA"
Case #3: Ident between 95.0 and 96.5, add order, "NA", "NA", "NA"
Case #4: Ident < 95.0 add "NA", "NA", "NA", "NA"
所需的输出为:
#OTU S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 Seq GInumber Accssn Ident Len M Gap Qs Qe Ss Se evalue bit phylum class order family genus species
OTU49 0 0 0 0 0 16 0 0 0 0 0 0 1 0 0 0 0 0 catat 1366104624 MG926900 82.911 158 23 4 2 157 18 173 2.17e-29 139 Arthropoda Insecta NA NA NA NA
OTU171 5 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 gattt 342734543 JN305047 95.513 156 7 0 2 157 23 178 9.63e-63 250 Arthropoda Insecta Lepidoptera NA NA NA
OTU803 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 aactt 290756623 GU580785 96.753 154 5 0 4 157 10 163 5.75e-65 257 Arthropoda Insecta Lepidoptera Geometridae NA NA
OTU2519 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 aattt 296792336 GU688553 98.039 153 3 0 1 153 18 170 9.56e-68 267 Arthropoda Insecta Lepidoptera Geometridae Operophtera Operophtera brumata
我写了这个脚本:
import csv
#Files
besthit_taxonomy_unique_file = "datafile2.txt"
OTUtablefile = "datafile1.txt"
outputfile = "outputfile.txt"
#Settings
OrderLevel = float(95.0)
FamilyLevel = float(96.5)
SpeciesLevel = float(98.0)
#Importing the OTU table, which is tab delimited
OTUtable = list(csv.reader(open(OTUtablefile, 'rU'), delimiter='\t'))
headerOTUs = OTUtable.pop(0)
#Importing the best hit taxonomy table, which is tab delimited
taxonomytable = list(csv.reader(open(besthit_taxonomy_unique_file, 'rU'), delimiter='\t'))
headertax = taxonomytable.pop(0)
headertax.pop(1)
#Getting the header info
totalheader = headerOTUs + headertax
#Merging and assigning the taxonomy at the appropriate level
outputtable = []
NAs = 4 * ["NA"] #This is a list of NAs so that I can add the appropriate number, depending on the Identity
for item in OTUtable:
OTU = item #Just to prevent issues with the list of lists
OTUIDtable = OTU[0]
print OTUIDtable
for thing in taxonomytable:
row = thing #Just to prevent issues with the list of lists
OTUIDtax = row[1]
if OTUIDtable == OTUIDtax:
OTU.append(row[0])
OTU += row[2:15]
PercentID = float(row[3])
if PercentID >= SpeciesLevel:
OTU += row[15:]
elif FamilyLevel <= PercentID < SpeciesLevel:
OTU += row[15:17]
OTU += NAs[:2]
elif OrderLevel <= PercentID < FamilyLevel:
print row[15]
OTU += row[15]
OTU += NAs[:3]
else:
OTU += NAs
outputtable.append(OTU)
#Writing the output file
f1 = open(outputfile, 'w')
for item in totalheader[0:-1]:
f1.write(str(item) + '\t')
f1.write(str(totalheader[-1]) + '\n')
for row in outputtable:
currentrow = row
for item in currentrow[0:-1]:
f1.write(str(item) + '\t')
f1.write(str(currentrow[-1]) + '\n')
在大多数情况下,输出是正确的,除了情况#3(标识在95和96.5之间)之外,当脚本输出命令的条目时,每个字母之间都有一个制表符。
以下是输出示例:
#OTU S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 Seq GInumber Accssn Ident Len M Gap Qs Qe Ss Se evalue bit phylum class order family genus species
OTU49 0 0 0 0 0 16 0 0 0 0 0 0 1 0 0 0 0 0 catat 1366104624 MG926900 82.911 158 23 4 2 157 18 173 2.17e-29 139 Arthropoda Insecta NA NA NA NA
OTU171 5 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 gattt 342734543 JN305047 95.513 156 7 0 2 157 23 178 9.63e-63 250 Arthropoda Insecta L e p i d o p t e r a NA NA NA
OTU803 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 aactt 290756623 GU580785 96.753 154 5 0 4 157 10 163 5.75e-65 257 Arthropoda Insecta Lepidoptera Geometridae NA NA
OTU2519 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 aattt 296792336 GU688553 98.039 153 3 0 1 153 18 170 9.56e-68 267 Arthropoda Insecta Lepidoptera Geometridae Operophtera Operophtera brumata
我只是不知道出了什么问题。在其余时间中,订单似乎包含正确的信息,但是在这种情况下,订单中的信息似乎存储为列表列表。但是,屏幕上的输出是这样的:
OTU171
Lepidoptera
这似乎并不表示列表的列表...
任何见解我都会很高兴。如果有人有使我的代码更具pythonic功能的想法,我也将不胜感激。
Andreanna