Question

我有一个非常大的文件（15亿行），格式如下：

1    67108547    67109226    gene1$transcript1    0    +    1    0
1    67108547    67109226    gene1$transcript1    0    +    2    1
1    67108547    67109226    gene1$transcript1    0    +    3    3
1    67108547    67109226    gene1$transcript1    0    +    4    4 

                                 .
                                 .
                                 .

1    33547109    33557650    gene2$transcript1    0    +    239    2
1    33547109    33557650    gene2$transcript1    0    +    240    0

                                 .
                                 .
                                 .

1    69109226    69109999    gene1$transcript1    0    +    351    1
1    69109226    69109999    gene1$transcript1    0    +    352    0

我想要做的是根据第4列上的标识符重新组织/排序此文件。该文件由块组成。如果连接第4,1,2和3列，则为每个块创建唯一标识符。这是dicionary all_exons 的关键，值是一个包含第8列所有值的numpy数组。然后我有第二个字典 unique_identifiers ，它具有属性作为关键字从第4列开始，并且值为相应块标识符的列表。作为输出，我按以下形式编写文件：

>gene1
0
1
3
4
1
0
>gene2
2
0

我已经编写了一些代码（见下文），但我的实现速度非常慢。运行大约需要18个小时。

import os
import sys
import time
from contextlib import contextmanager
import pandas as pd
import numpy as np


def parse_blocks(bedtools_file):
    unique_identifiers = {} # Dictionary with key: gene, value: list of exons
    all_exons = {} # Dictionary contatining all exons

    # Parse file and ...
    with open(bedtools_file) as fp:
        sp_line = []
        for line in fp:
            sp_line = line.strip().split("\t")
            current_id = sp_line[3].split("$")[0]

           identifier="$".join([sp_line[3],sp_line[0],sp_line[1],sp_line[2]])
           if(identifier in all_exons):
               item = float(sp_line[7])
               all_exons[identifier]=np.append(all_exons[identifier],item)
           else:
               all_exons[identifier] = np.array([sp_line[7]],float)

           if(current_id in unique_identifiers):
               unique_identifiers[current_id].add(identifier)
           else:
               unique_identifiers[current_id] =set([identifier])
  return unique_identifiers, all_exons

identifiers, introns = parse_blocks(options.bed)

w = open(options.out, 'w')
for gene in sorted(list(identifiers)):
    w.write(">"+str(gene)+"\n")
    for intron in sorted(list(identifiers[gene])):
        for base in introns[intron]:
            w.write(str(base)+"\n")
w.close()

如何才能推动上述代码以便更快地运行？

Answer 1

您还导入pandas，因此，我提供的pandas解决方案基本上只需要两行代码。但是，我不知道它是如何在大型数据集上执行的，以及它是否比您的方法更快（但我很确定它是）。

在下面的示例中，您提供的数据存储在table.txt中。然后我使用groupby获取第8列中的所有值，将它们存储在第4列中相应标识符的列表中（注意我的索引从0开始）并将此数据结构转换为可以使用的字典然后轻松打印。

import pandas as pd

df=pd.read_csv("table.txt", header=None, sep = r"\s+") # replace the separator by e.g. '/t'

op = dict(df.groupby(3)[7].apply(lambda x: x.tolist()))

所以在这种情况下op看起来像这样：

{'gene1$transcript1': [0, 1, 3, 4, 1, 0], 'gene2$transcript1': [2, 0]}

现在您可以像这样打印输出并将其输出到某个文件中：

for k,v in op.iteritems():

    print k.split('$')[0]
    for val in v:
        print val

这为您提供了所需的输出：

gene1
0
1
3
4
1
0
gene2
2
0

也许您可以尝试一下，让我知道它与您的解决方案相比如何！？

EDIT2：

在您提到的评论中，您希望以正确的顺序打印基因。您可以按如下方式执行此操作：

# add some fake genes to op from above
op['gene0$stuff'] = [7,9]       
op['gene4$stuff'] = [5,9]

# print using 'sorted'
for k,v in sorted(op.iteritems()):

    print k.split('$')[0]
    for val in v:
        print val

给你：

gene0
7
9
gene1
0
1
3
4
1
0
gene2
2
0
gene4
5
9

EDIT1：

我不确定是否有重复项，但您可以通过执行以下操作轻松摆脱它们：

op2 = dict(df.groupby(3)[7].apply(lambda x: set(x)))

现在op2看起来像这样：

{'gene1$transcript1': {0, 1, 3, 4}, 'gene2$transcript1': {0, 2}}

您可以像以前一样打印输出：

for k,v in op2.iteritems():

    print k.split('$')[0]
    for val in v:
        print val

给你

gene1
0
1
3
4
gene2
0
2

Answer 2

我会尝试简化您的问题，我的解决方案是这样的：

首先，扫描大文件。对于每个不同的current_id，打开一个临时文件，并将第8列的值附加到该文件。
完整扫描后，将所有块连接到结果文件。

以下是代码：

# -*- coding: utf-8 -*-
import os
import tempfile
import subprocess


class ChunkBoss(object):
    """Boss for file chunks"""

    def __init__(self):
        self.opened_files = {}

    def write_chunk(self, current_id, value):
        if current_id not in self.opened_files:
            self.opened_files[current_id] = open(tempfile.mktemp(), 'wb')
            self.opened_files[current_id].write('>%s\n' % current_id)

        self.opened_files[current_id].write('%s\n' % value)

    def cat_result(self, filename):
        """Catenate chunks to one big file
        """
        # Sort the chunks
        chunk_file_list = []
        for current_id in sorted(self.opened_files.keys()):
            chunk_file_list.append(self.opened_files[current_id].name)

        # Flush chunks
        [chunk.flush() for chunk in self.opened_files.values()]

        # By calling cat command
        with open(filename, 'wb') as fp:
            subprocess.call(['cat', ] + chunk_file_list, stdout=fp, stderr=fp)

    def clean_up(self):
        [os.unlink(chunk.name) for chunk in self.opened_files.values()]


def main():
    boss = ChunkBoss()
    with open('bigfile.data') as fp:
        for line in fp:
            data = line.strip().split()
            current_id = data[3].split("$")[0]
            value = data[7]

            # Write value to temp chunk
            boss.write_chunk(current_id, value)

    boss.cat_result('result.txt')
    boss.clean_up()

if __name__ == '__main__':
    main()

我测试了我的脚本的性能，bigfile.data包含大约150k行。我的笔记本电脑花了大约0.5秒完成。也许你可以尝试一下。

在速度方面改进python代码

2 个答案: