我在处理文件时遇到了一些麻烦。
我正在尝试比较这两个文件:
文件1(第123页):
Number,Position,Peptide
1,62,dgwgkvttfpsva
2,189,vtikndteddsin
3,157,qqgkastppvkks
3,165,pvkksftpskspa
3,181,kkepvktpspapa
3,293,tkppsmtesslkn
3,30,rgsdpdttwliis
3,31,gsdpdttwliisp
3,526,ppprratpekkpk
4,150,etegiatpkqken
4,194,qsngketenaena
4,312,egrgdntgdqnav
4,328,dfeksdtegsrig
4,347,fgkrnlteesdvw
4,84,nlpkpetneedee
5,25,qnpllytdflssn
文件2(n123):
Number,Position,Peptide
1,4,None
1,7,mlrtrltncslwr
1,17,lwrpyytsslsrv
1,50,vnkidltvgiykd
1,62,dgwgkvttfpsva
1,63,gwgkvttfpsvak
1,90,lsylpitgskefq
1,126,risfvqtlsgtga
1,130,vqtlsgtgalava
1,192,wieqlktfaynnq
1,218,acchnptgldptk
1,223,ptgldptkeqwek
1,233,wekiidtiyelkm
1,302,gslsvitpatann
1,305,svitpatanngkf
1,400,hgmfyytrfspkq
1,419,nyfvyltgdgrls
2,32,ggkkfptlgawyd
2,47,neyefqtrcpiil
2,63,hrnkhftfachlk
2,88,naassetsspsan
2,97,psannntnppgtp
2,102,ntnppgtpdhihh
....
5,356,pfssmhttatfqi
5,357,fssmhttatfqik
5,359,smhttatfqikqe
5,375,qkienntaglkdg
5,424,qiskentmmkkki
5,452,lhmqectinggnn
如您所见,文件1是文件2的子集。我的目的是找到非重叠部分和输出第3列,即肽部分。
这是我的代码:
n = open(r'C:\Users\dengziqi\Desktop\n123.csv','r')
p = open(r'C:\Users\dengziqi\Desktop\p123.csv','r')
n1 = csv.reader(n)
p1 = csv.reader(p)
for p2 in p1:
for n2 in n1:
if n2[1]!= p2[1]:
print n2[2]
我已尝试过我的代码,但它没有做任何选择。它只输出原始列。
预期结果:
Number,Position,Peptide
1,4,None
1,7,mlrtrltncslwr
1,17,lwrpyytsslsrv
1,50,vnkidltvgiykd
1,63,gwgkvttfpsvak
1,90,lsylpitgskefq
1,126,risfvqtlsgtga
....
2,32,ggkkfptlgawyd
2,47,neyefqtrcpiil
2,63,hrnkhftfachlk
2,88,naassetsspsan
2,97,psannntnppgtp
2,102,ntnppgtpdhihh
2,138,skldfvtddleyh
2,148,eyhlanthpddtn
2,153,nthpddtndkves
2,184,fkqqgvtikndte
2,189,vtikndteddsin
2,210,ddesgpthgndsg
2,228,eeddvhtqmtkny
2,231,dvhtqmtknysdv
.....
新要求(20.03.2017): 在相同的数字下,如果文件2(n123)中的位置属于file1(p123)±50中的位置,则放弃并输出剩余的肽
例如: 文件1(第123页):
Number,Position,Peptide
1,62,dgwgkvttfpsva
....
文件2(n123):
Number,Position,Peptide
1,4,None
1,7,mlrtrltncslwr
1,17,lwrpyytsslsrv
1,50,vnkidltvgiykd
1,62,dgwgkvttfpsva
1,63,gwgkvttfpsvak
1,90,lsylpitgskefq
1,126,risfvqtlsgtga
1,130,vqtlsgtgalava
1,192,wieqlktfaynnq
1,218,acchnptgldptk
1,223,ptgldptkeqwek
1,233,wekiidtiyelkm
1,302,gslsvitpatann
1,305,svitpatanngkf
1,400,hgmfyytrfspkq
1,419,nyfvyltgdgrls
....
所以在比较之后,我需要得到:
Number,Position,Peptide
1,4,None
1,7,mlrtrltncslwr1,126,risfvqtlsgtga
1,130,vqtlsgtgalava
1,192,wieqlktfaynnq
1,218,acchnptgldptk
1,223,ptgldptkeqwek
1,233,wekiidtiyelkm
1,302,gslsvitpatann
1,305,svitpatanngkf
1,400,hgmfyytrfspkq
1,419,nyfvyltgdgrls
我已经编写了一些可以使其工作的代码,但问题是脚本需要花费太多时间。我想知道如何改进它。 我的代码:
import pandas as pd
import numpy as np
import re
dfn=pd.read_csv('n123.csv')
dfp=pd.read_csv('p123.csv')
collection1 = []
for index, row in dfp.iterrows():
for index2, row2 in dfn.iterrows(): #iterate two dataframe and return data
if row2[0]== row[0]: #to determine if it is under the same Number
if int(row2[1])in range(int(row[1])-50, int(row[1])+50) : #determine if the position in n123 ∈ [position in p123 ± 50].
collection1.append(row2[2]) #collect the correspond peptide
collection2 = list(set(collection1)) #remove the duplicate peptide
collection3 = dfn.iloc[:,2].tolist()#return all peptide from n123 as list
collection4 = list(set(collection3) - set(collection2)) #n123 file peptide list minus the collection 3 list = peptide whose position not in range (position in p123 ± 50)
ng = open("purenegativecollectiondata1.txt", "wb")
for ip in collection4:
ng.write(ip)
ng.write('\r\n')
ng.close()
我能理解我的代码非常间接,所以我需要一些帮助来改进它。
答案 0 :(得分:0)
使用集合来存储数字/位置的中间结构,并对其进行查询。示例(非完整,仅作为基础):
...
seen = set()
for n2 in n1:
seen.add((n2[0], n2[1])) # ignore peptide column
for p2 in p1:
if (p2[0], p2[1]) not in seen:
print "%s,%s,%s" % p2
旁注:我没有考虑标题,你应该自己做。也许使用DictReader
和DictWriter
是一个更好的主意。