已解决
我需要能够计算输出文件中所有字符串中的'N'数量。但是,当我打印结果时,我总是得到0或无。有人在我的代码中看到错误了吗?
def ncount(filename):
count = 0
with open(filename, 'r') as file:
for words in file:
if words in file == "N":
count = count + 1
return count
count = ncount("output_seq.txt")
print(count)
输出文件如下:
["GATTTTCTATGACATCTAGAAGAAAAAGAAAGACTATAAGATGTATAAAAACAAGAGGNNCNGAGAAAATCGAGACAGGTGGTGAGAATCTGCCGAATTAN",
"AACATTGCTGAGAGGTTCGATCGTGATCCCTGCAAGAAAAAATAAAGGTGGAGATGATNNCNCAATGTATGTTGTCTCGTCACACTGGTTTAATGATTTTN",
"CTTTTTTTTAAATATTTCGGGCGGTAATTTTTTCTGCCATCTTTTTCACTAAGAAAACTTTCAGGCGTTGTTAAGCGGTGGAATCTATAGAGCTGTCTCTT",
"ATGTATCTAACGAGACAGCAATGGGAATTTTGTATTAAAAAAAAGAAGAAATACATATTTTGAAACAGGAATGTTGTTTGATTTTTAAAGAAAAAAGGAAA",
"TCCAGACGCAAAANNNNNNNNTTTTTGTCTCAAGACTACAGTACCCTGGGTCTCGCCACGAAAATTGTTTGTTAAATGAGAAAATGTGTGCGCCTTTAAAG",
""]
这是一个虚拟文件,仅包含5个序列。实际文件包含数千个这样的字符串
我一直收到的输出是:
0
答案 0 :(得分:2)
使用console.log
以字符串的形式逐行遍历文件。然后,您可以简单地使用file.readlines()
方法来计算字符串中术语的出现:
count()
对于您的def ncount(filename):
count = 0
with open(filename, 'r') as myfile:
for line in myfile.readlines():
count += line.count('N')
return count
count = ncount("somefile.txt")
print(count)
文件,这将输出"output_seq.txt"
。
答案 1 :(得分:0)
如果您不一定需要Python函数/模块,而只是寻找一种即席解决方案来获取每行“ N”次出现的次数,则可以直接在Unix终端上使用{{ 1}}:
awk
这会将每个行号(1、2、3等)和相应的'N'个出现次数输出到输出。
编辑:要通过Python运行此bash命令,您可以使用cat your_file_name | awk '{print gsub(/N/,"")}'
模块:
subprocess
您甚至不需要将bash命令的输出存储到输出文件中。您可以将其读取为字符串(import subprocess
input_file = 'my-input-file'
cmd = "cat " + input_file + " | awk '{print gsub(/N/," + '"")}' + "'"
print(cmd)
# Unix cmd call
p = subprocess.Popen(cmd, shell=True,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
# read STDOUT and STDERR
stdout, _ = p.communicate()
stdout = str(stdout, "utf-8")
# convert stdout string to a list of integers (with num of 'N' occurrences per line)
n_count = [int(i) for i in stdout.split('\n')[:-1]]
print(n_count)
),然后可以将其拆分为整数列表(stdout
)。
但是,由于您要在Python中实现此功能,因此我建议您使用本机Python函数,而不是从n_count
嵌入此临时解决方案。
答案 2 :(得分:0)
此代码的效率不如glhr,但可以帮助您了解发生了什么。它将打印每个字符(甚至是引号或空格),并在找到一个字符时附加“找到一个N”:
def ncount(filename):
count = 0
with open(filename, 'r') as input_file:
for words in input_file:
for letter in words:
print(letter)
if letter == "N":
print('%s found an N' % letter)
count = count + 1
else:
print(letter)
return count
count = ncount("output_seq.txt")
print(count)
部分输出
A
G
G
A
A
G
G
G
G
N
N found an N
N
N found an N
C
C
N
N found an N
G
答案 3 :(得分:0)
我最终能够使脚本正常工作。感谢所有给我提示并为我的问题提供帮助的人。
这段代码只是一个较大脚本的一小部分,我认为效果很好。但是最后有一行代码以某种方式干扰了这段代码。我使用@glhr他或她的代码,并更改了脚本的其余部分以使其正常工作。
这是我脚本的其余部分。
# import the biopython packages
from Bio import SeqIO
# parse the fastq file into a list of records
with open("output_rec_qual.txt", "w") as f:
for record in SeqIO.parse("dummy.txt", "fastq"):
# write the id and the corresponding quality scores to a separate file
f.write(record.format("qual"))
f.close()
# read the file again, this time parse the sequences into another output file. In order to perform the n-count
def readfastq(filename):
with open(filename) as file:
while True:
file.readline()
seq = file.readline().rstrip()
file.readline()
file.readline()
f = open("output_seq.txt", 'a')
f.write(seq)
if len(seq) == 0:
break
return seq
seq = readfastq("dummy.txt")
# n-count
def ncount(filename):
count = 0
with open(filename, 'r') as file:
for line in file.readlines():
count += line.count('N')
return count
count = ncount("output_seq.txt")
print(count)