我有一个包含序列的大文件;我只想分析最后一组字符,这些字符恰好是可变长度的。在每一行中,我想在文本文件中取每组的第一个字符和最后一个字符,并计算这些字符的总实例。
以下是文件中数据的示例:
-1iqd_BA_0_CDRH3.pdb kabat H3 P DPDAFD V
-1iqw_HL_0_CDRH3.pdb kabat H3 N RDYSNNWYFD V
我想取“H3”之后的第一个字符和最后一个字符(例如粗体)。 这两行的输出应为:
第一个计数器({'N':1,'P':1})
最后一个计数器({'V':2})
这是我到目前为止所做的:
f = open("C:/CDRH3.txt", "r")
from collections import Counter
grab = 1
for line in f:
line=line.rstrip()
left,sep,right=line.partition(" H3 ")
if sep:
AminoAcidsFirst = right[:grab]
AminoAcidsLast = right[-grab:]
print ("first ",Counter(line[:] for line in AminoAcidsFirst))
print ("last ",Counter(line[:] for line in AminoAcidsLast))
f.close()
这将打印最后一行数据的计数,如下所示:
first Counter({'N': 1})
last Counter({'V': 1})
如何计算文件中所有行中的所有这些字符? 笔记: 打印(AminoAcidsFirst)或(AminoAcidsLast)给出垂直所有行的所需列表,但我无法计算它或将其输出到文件。写入新文件只会写入原始文件最后一行的字符。 谢谢!
答案 0 :(得分:2)
不需要Counter:只需在split
之后抓取最后一个标记并计算第一个和最后一个字符:
first_counter = {}
last_counter = {}
for line in f:
line=line.split()[-1] # grab the last token
first_counter[line[0]] = first_counter.get(line[0], 0) + 1
last_counter[line[-1]] = last_counter.get(line[-1], 0) + 1
print("first ", first_counter)
print("last ", last_counter)
<强>输出强>
first {'P': 1, 'N': 1}
last {'V': 2}
答案 1 :(得分:0)
创建2个空列表并在每个循环中追加,如下所示:
f = open("C:/CDRH3.txt", "r")
from collections import Counter
grab = 1
AminoAcidsFirst = []
AminoAcidsLast = []
for line in f:
line=line.rstrip()
left,sep,right=line.partition(" H3 ")
if sep:
AminoAcidsFirst.append(right[:grab])
AminoAcidsLast.append(right[-grab:])
print ("first ",Counter(line[:] for line in AminoAcidsFirst))
print ("last ",Counter(line[:] for line in AminoAcidsLast))
f.close()
下面:
创建空列表:
AminoAcidsFirst = []
AminoAcidsLast = []
在每个循环中追加:
AminoAcidsFirst.append(right[:grab])
AminoAcidsLast.append(right[-grab:])
答案 2 :(得分:0)
我想指出的两件重要事情
永远不会在您的计算机上显示文件路径,如果您来自科学界,这尤其适用
使用with...as
方法
现在是程序
from collections import Counter
filePath = "C:/CDRH3.txt"
AminoAcidsFirst, AminoAcidsLast = [], [] # important! these should be lists
with open(filePath, 'rt') as f: # rt not r. Explicit is better than implicit
for line in f:
line = line.rstrip()
left, sep, right = line.partition(" H3 ")
if sep:
AminoAcidsFirst.append( right[0] ) # really no need of extra grab=1 variable
AminoAcidsLast.append( right[-1] ) # better than right[-grab:]
print ("first ",Counter(AminoAcidsFirst))
print ("last ",Counter(AminoAcidsLast))
请勿执行line.strip()[-1]
,因为sep
验证非常重要
<强>输出强>
first {'P': 1, 'N': 1}
last {'V': 2}
注意:数据文件可能会非常大,您可能会遇到内存问题或计算机挂起问题。那么,我可以建议懒读吗? Folloing是一个更强大的程序
from collections import Counter
filePath = "C:/CDRH3.txt"
AminoAcidsFirst, AminoAcidsLast = [], [] # important! these should be lists
def chunk_read(fileObj, linesCount = 100):
lines = fileObj.readlines(linesCount)
yield lines
with open(filePath, 'rt') as f: # rt not r. Explicit is better than implicit
for aChunk in chunk_read(f):
for line in aChunk:
line = line.rstrip()
left, sep, right = line.partition(" H3 ")
if sep:
AminoAcidsFirst.append( right[0] ) # really no need of extra grab=1 variable
AminoAcidsLast.append( right[-1] ) # better than right[-grab:]
print ("first ",Counter(AminoAcidsFirst))
print ("last ",Counter(AminoAcidsLast))
答案 3 :(得分:0)
如果您将语句放在for循环的底部或之后,以打印AminoAcidsFirst
和AminoAcidsLast
,您将看到在每次迭代时您只是分配一个新的值。在将它们提供给collections.Counter
之前,您的意图应该是收集,包含或累积这些值。
s = ['-1iqd_BA_0_CDRH3.pdb kabat H3 PDPDAFDV', '-1iqw_HL_0_CDRH3.pdb kabat H3 NRDYSNNWYFDV']
立即修复代码就是积累字符:
grab = 1
AminoAcidsFirst = ''
AminoAcidsLast = ''
for line in s:
line=line.rstrip()
left,sep,right=line.partition(" H3 ")
if sep:
AminoAcidsFirst += right[:grab]
AminoAcidsLast += right[-grab:]
print ("first ",collections.Counter(AminoAcidsFirst))
print ("last ",collections.Counter(AminoAcidsLast))
另一种方法是按需制作角色。定义一个生成函数,它将产生你想要计算的东西
def f(iterable):
for thing in iterable:
left, sep, right = thing.partition(' H3 ')
if sep:
yield right[0]
yield right[-1]
然后将其提供给collections.Counter
z = collections.Counter(f(s))
或使用文件作为数据源:
with open('myfile.txt') as f1:
# lines is a generator expression
# that produces stripped lines
lines = (line.strip() for line in f1)
z = collections.Counter(f(lines))