Question

假设我有一个带有n个DNA序列的文件，每个序列都在一行中。我需要将它们变成一个列表，然后计算每个序列的长度，然后计算所有序列的总长度。我不知道在列入清单之前该怎么做。

# open file and writing each sequences' length
f= open('seq.txt' , 'r')
for line in f:
    line= line.strip()
    print (line)
    print ('this is the length of the given sequence', len(line))

# turning into a list:  
lines = [line.strip() for line in open('seq.txt')]
print (lines)

如何从列表中进行数学计算？防爆。所有序列的总长度在一起？不同长度的标准偏差等。

Answer 1

尝试此输出单个长度并计算总长度：

    lines = [line.strip() for line in open('seq.txt')]
    total = 0
    for line in lines:
       print 'this is the length of the given sequence: {}'.format(len(line))
       total += len(line)
    print 'this is the total length: {}'.format(total)

Answer 2

查看statistics模块。您可以找到各种平均值和点差的度量值。

您将使用len获取任何序列的长度。

在您的情况下，您希望将序列映射到它们的长度：

from statistics import stdev

with open("seq.txt") as f:
    lengths = [len(line.strip()) for line in f]

print("Number of sequences:", len(lengths))
print("Standard deviation:", stdev(lengths))

编辑：因为评论中提到了这一点：以下是如何根据实际长度将实例群集到不同的文件中：

from statistics import stdev, mean
with open("seq.txt") as f:
    sequences = [line.strip() for line in f]
lengths = [len(sequence) for sequence in sequences]

mean_ = mean(lengths)
stdev_ = stdev(lengths)

with open("below.txt", "w") as below, open("above.txt", "w") as above, open("normal.txt", "w") as normal:
    for sequence in sequences:
        if len(sequence) > mean+stdev_:
            above.write(sequence + "\n")
        elif mean+stdev_ > len(sequence > mean-stdev_: #inbetween
            normal.write(sequence + "\n")
        else:
            below.write(sequence + "\n")

Answer 3

map和reduce函数可用于处理集合。

import operator

f= open('seq.txt' , 'r')
for line in f:
  line= line.strip()
  print (line)
  print ('this is the length of the given sequence', len(line))

# turning into a list:
lines = [line.strip() for line in open('seq.txt')]
print (lines)

print('The total length is 'reduce(operator.add,map(len,lines)))

Answer 4

只是几句话。使用with来处理文件，这样你就不用担心在读完\写入，刷新等之后关闭它们。另外，既然你循环遍历文件一次，为什么不创建列表呢？你不需要再经历它。

# open file and writing each sequences' length
with open('seq.txt', 'r') as f:
    sequences = []
    total_len = 0
    for line in f:
        new_seq = line.strip()
        sequences.append(new_seq)
        new_seq_len = len(new_seq)
        total_len += new_seq_len

print('number of sequences: {}'.format(len(sequences)))
print('total lenght: {}'.format(total_len))
print('biggest sequence: {}'.format(max(sequences, key=lambda x: len(x))))
print('\t with length {}'.format(len(sorted(sequences, key=lambda x: len(x))[-1])))
print('smallest sequence: {}'.format(min(sequences, key=lambda x: len(x))))
print('\t with length {}'.format(len(sorted(sequences, key=lambda x: len(x))[0])))

我已经提供了一些后期处理信息，让您了解如何进行此操作。如果您有任何问题，请询问。

Answer 5

您已经了解了如何使用追加来获取序列列表和长度列表。

    lines = [line.strip() for line in open('seq.txt')]
    total = 0
    sizes = []
    for line in lines:
       mysize = len(line)
       total += mysize
       sizes.append(mysize)

请注意，您还可以使用for循环读取每一行并附加到两个列表，而不是将每行读入列表，然后循环遍历列表。这是你想要的事情。

您可以使用统计信息库（从Python 3.4开始）获取长度列表的统计信息。

statistics — Mathematical statistics functions

mean（）数据的算术平均值（“平均值”）。中位数（）中位数   价值）。 median_low（）数据中位数低。
  median_high（）数据的中位数高。 median_grouped（）中位数，或第50位   百分位数，分组数据。 mode（）模式（最常见的值）   离散数据。 pstdev（）数据的人口标准差   pvariance（）数据的总体方差。 stdev（）示例标准   数据偏差。 variance（）数据的样本方差。

您还可以使用Standard deviation of a list

上的答案

请注意，有一个答案实际上显示了为统计模块添加到Python 3.4的代码。如果您使用的是旧版本，则可以使用该代码或获取自己系统的统计模块代码。

Answer 6

这将满足您的需求。要进行其他计算，您可能希望将结果从文本文件保存到列表或集中，这样您就不需要再次从文件中读取。

total_length = 0  # Create a variable that will save our total length of lines read

with open('filename.txt', 'r') as f:
    for line in f:
        line = line.strip()
        total_length += len(line)  # Add the length to our total
        print("Line Length: {}".format(len(line)))

print("Total Length: {}".format(total_length))

如何通过python计算属于字符串列表的每个字符串长度？

6 个答案: