如何计算输出文件中的字母N,该输出文件是写入txt文件的字符串列表?

时间:2019-04-20 17:52:12

标签: python python-3.x

已解决

我需要能够计算输出文件中所有字符串中的'N'数量。但是,当我打印结果时,我总是得到0或无。有人在我的代码中看到错误了吗?

def ncount(filename):
    count = 0
    with open(filename, 'r') as file:
        for words in file:
            if words in file == "N":
                count = count + 1
                return count
count = ncount("output_seq.txt")
print(count)

输出文件如下:

["GATTTTCTATGACATCTAGAAGAAAAAGAAAGACTATAAGATGTATAAAAACAAGAGGNNCNGAGAAAATCGAGACAGGTGGTGAGAATCTGCCGAATTAN", 
"AACATTGCTGAGAGGTTCGATCGTGATCCCTGCAAGAAAAAATAAAGGTGGAGATGATNNCNCAATGTATGTTGTCTCGTCACACTGGTTTAATGATTTTN", 
"CTTTTTTTTAAATATTTCGGGCGGTAATTTTTTCTGCCATCTTTTTCACTAAGAAAACTTTCAGGCGTTGTTAAGCGGTGGAATCTATAGAGCTGTCTCTT", 
"ATGTATCTAACGAGACAGCAATGGGAATTTTGTATTAAAAAAAAGAAGAAATACATATTTTGAAACAGGAATGTTGTTTGATTTTTAAAGAAAAAAGGAAA", 
"TCCAGACGCAAAANNNNNNNNTTTTTGTCTCAAGACTACAGTACCCTGGGTCTCGCCACGAAAATTGTTTGTTAAATGAGAAAATGTGTGCGCCTTTAAAG", 
""]

这是一个虚拟文件,仅包含5个序列。实际文件包含数千个这样的字符串

我一直收到的输出是:

0

4 个答案:

答案 0 :(得分:2)

使用console.log以字符串的形式逐行遍历文件。然后,您可以简单地使用file.readlines()方法来计算字符串中术语的出现:

count()

对于您的def ncount(filename): count = 0 with open(filename, 'r') as myfile: for line in myfile.readlines(): count += line.count('N') return count count = ncount("somefile.txt") print(count) 文件,这将输出"output_seq.txt"

答案 1 :(得分:0)

如果您不一定需要Python函数/模块,而只是寻找一种即席解决方案来获取每行“ N”次出现的次数,则可以直接在Unix终端上使用{{ 1}}:

awk

这会将每个行号(1、2、3等)和相应的'N'个出现次数输出到输出。

编辑:要通过Python运行此bash命令,您可以使用cat your_file_name | awk '{print gsub(/N/,"")}' 模块:

subprocess

您甚至不需要将bash命令的输出存储到输出文件中。您可以将其读取为字符串(import subprocess input_file = 'my-input-file' cmd = "cat " + input_file + " | awk '{print gsub(/N/," + '"")}' + "'" print(cmd) # Unix cmd call p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE) # read STDOUT and STDERR stdout, _ = p.communicate() stdout = str(stdout, "utf-8") # convert stdout string to a list of integers (with num of 'N' occurrences per line) n_count = [int(i) for i in stdout.split('\n')[:-1]] print(n_count) ),然后可以将其拆分为整数列表(stdout)。

但是,由于您要在Python中实现此功能,因此我建议您使用本机Python函数,而不是从n_count嵌入此临时解决方案。

答案 2 :(得分:0)

此代码的效率不如glhr,但可以帮助您了解发生了什么。它将打印每个字符(甚至是引号或空格),并在找到一个字符时附加“找到一个N”:

def ncount(filename):
    count = 0
    with open(filename, 'r') as input_file:
        for words in input_file:
            for letter in words:
                print(letter)
                if letter == "N":
                    print('%s found an N' % letter)
                    count = count + 1
                else:
                    print(letter)

        return count

  count = ncount("output_seq.txt")
  print(count)

部分输出

A
G
G
A
A
G
G
G
G
N
N found an N
N
N found an N
C
C
N
N found an N
G

答案 3 :(得分:0)

我最终能够使脚本正常工作。感谢所有给我提示并为我的问题提供帮助的人。

这段代码只是一个较大脚本的一小部分,我认为效果很好。但是最后有一行代码以某种方式干扰了这段代码。我使用@glhr他或她的代码,并更改了脚本的其余部分以使其正常工作。

这是我脚本的其余部分。

# import the biopython packages
from Bio import SeqIO

# parse the fastq file into a list of records
with open("output_rec_qual.txt", "w") as f:
    for record in SeqIO.parse("dummy.txt", "fastq"):
        # write the id and the corresponding quality scores to a separate file
        f.write(record.format("qual"))
f.close()


# read the file again, this time parse the sequences into another output file. In order to perform the n-count
def readfastq(filename):
    with open(filename) as file:
        while True:
            file.readline()
            seq = file.readline().rstrip()
            file.readline()
            file.readline()
            f = open("output_seq.txt", 'a')
            f.write(seq)
            if len(seq) == 0:
                break
    return seq


seq = readfastq("dummy.txt")



# n-count
def ncount(filename):
    count = 0
    with open(filename, 'r') as file:
        for line in file.readlines():
            count += line.count('N')
    return count


count = ncount("output_seq.txt")
print(count)