解析序列输出 - Python

时间:2014-03-14 20:59:04

标签: python parsing python-2.7 bioinformatics phylogeny

我从细菌群落测序中得到了这些数据。 我知道一些基本的Python,我正在完成codecademy教程。 出于实际目的,请将OTU视为"物种"的另一个词。

以下是原始数据的示例:

OTU ID   OTU Sum Lineage
591820   1083    k__Bacteria; p__Fusobacteria; c__Fusobacteria; o__Fusobacteriales; f__Fusobacteriaceae; g__u114; s__
532752   517     k__Bacteria; p__Fusobacteria; c__Fusobacteria; o__Fusobacteriales; f__Fusobacteriaceae; g__u114; s__
218456   346     k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Burkholderiales; f__Alcaligenaceae; g__Bordetella; s__
590248   330     k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Burkholderiales; f__Alcaligenaceae; g__; s__
343284   321     k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Burkholderiales; f__Comamonadaceae; g__Limnohabitans; s__

数据包括三件事:物种的参考编号,样本中物种的数量,以及所述物种的分类。

我尝试做的是将所有时间加起来为分类家族找​​到序列(在数据中指定为f_x)。

以下是所需输出的示例:

f__Fusobacteriaceae 1600
f__Alcaligenaceae  676
f__Comamonadaceae  321

这不是一个班级。几个月前我开始学习python,所以我至少能够查找任何建议。我知道如何通过缓慢的方式(复制和粘贴excel)来实现它,所以这是为了将来的参考。

4 个答案:

答案 0 :(得分:1)

如果文件中的行真的如此,你可以

from collections import defaultdict
import re
nums = defaultdict(int)
with open("file.txt") as f:
    for line in f:
        items =  line.split(None, 2)  # Split twice on any whitespace
        if items[0].isdigit():
            key = re.search(r"f__\w+", items[2]).group(0)
            nums[key] += int(items[1])

结果:

>>> nums
defaultdict(<type 'int'>, {'f__Comamonadaceae': 321, 'f__Fusobacteriaceae': 1600, 
'f__Alcaligenaceae': 676})

答案 1 :(得分:1)

另一种解决方案,使用collections.Counter

from collections import Counter

counter = Counter()

with open('data.txt') as f:
    # skip header line
    next(f)
    for line in f:
        # Strip line of extraneous whitespace
        line = line.strip()

        # Only process non-empty lines
        if line:
            # Split by consecutive whitespace, into 3 chunks (2 splits)
            otu_id, otu_sum, lineage = line.split(None, 2)

            # Split the lineage tree into a list of nodes
            lineage = [node.strip() for node in lineage.split(';')]

            # Extract family node (assuming there's only one)
            family = [node for node in lineage if node.startswith('f__')][0]

            # Increase count for this family by `otu_sum`
            counter[family] += int(otu_sum)

for family, count in counter.items():
    print "%s %s" % (family, count)

有关None参数的详细信息(匹配连续的空格),请参阅str.split()的文档。

答案 2 :(得分:0)

获取所有原始数据并首先处理它,我的意思是构建它,然后使用结构化数据来执行您想要的任何操作。 如果您拥有GB的数据,则可以使用elasticsearch。在这种情况下输入您的原始数据和查询所需的字符串f_ *并获取所有条目并添加它们

答案 3 :(得分:0)

基本的python非常可行。将Library Reference放在枕头下面,因为您经常要引用它。

你可能最终会做类似这样的事情(我会用更长久以来更可读的方式来编写它 - 有压缩代码的方法,并且更快地完成这项工作)。< / p>

# Open up a file handle
file_handle = open('myfile.txt')
# Discard the header line
file_handle.readline()

# Make a dictionary to store sums
sums = {}

# Loop through the rest of the lines
for line in file_handle.readlines():
    # Strip off the pesky newline at the end of each line.
    line = line.strip()

    # Put each white-space delimited ... whatever ... into items of a list.
    line_parts = line.split()

    # Get the first column
    reference_number = line_parts[0]

    # Get the second column, convert it to an integer
    sum = int(line_parts[1])

    # Loop through the taxonomies (the rest of the 'columns' separated by whitespace)
    for taxonomy in line_parts[2:]:
        # skip it if it doesn't start with 'f_'
        if not taxonomy.startswith('f_'):
            continue
        # remove the pesky semi-colon
        taxonomy = taxonomy.strip(';')
        if sums.has_key(taxonomy):
            sums[taxonomy] += int(sum)
        else:
            sums[taxonomy] = int(sum)

# All done, do some fancy reporting.  We'll leave sorting as an exercise to the reader.
for taxonomy in sums.keys():
    print("%s %d" % (taxonomy, sums[taxonomy]))