查询表

Question

我有一个执行文本文件操作的代码。尽管文本文件非常大，并且按照我目前计算的代码，它需要30天才能完成。

如果多处理是他的唯一方法，那么我有一台40核的服务器。

Cell_line_final2.bed：

chr1    778704  778912  MSPC_Peak_37509  8.43   cell_line   GM12878  CTCF   ENCSR000AKB CNhs12333   132
chr1    778704  778912  MSPC_Peak_37509  8.43   cell_line   GM12878  CTCF   ENCSR000AKB CNhs12331   132
chr1    778704  778912  MSPC_Peak_37509  8.43   cell_line   GM12878  CTCF   ENCSR000AKB CNhs12332   132
chr1    869773  870132  MSPC_Peak_37508  74.0   cell_line   GM12878  CTCF   ENCSR000AKB CNhs12333   132
...
...

tf_TPM2.bed：

CNhs12333   2228319     4.41    CTCF
CNhs12331   6419919     0.0     HES2
CNhs12332   6579994     0.78    ZBTB48
CNhs12333   8817465     0.0     RERE
...
...

所需的输出是在“ Cell_line_final2.bed”中添加一列，其中“ tf_TPM2.bed”的第一和第四列同时匹配“ Cell_line_final2.bed”的第十和第八列。

chr1    778704  778912  MSPC_Peak_37509  8.43   cell_line   GM12878  CTCF   ENCSR000AKB CNhs12333   132   4.41
chr1    778704  778912  MSPC_Peak_37509  8.43   cell_line   GM12878  HES2   ENCSR000AKB CNhs12331   132   0.0
chr1    778704  778912  MSPC_Peak_37509  8.43   cell_line   GM12878  CTCF   ENCSR000AKB CNhs12332   132   0.78
chr1    869773  870132  MSPC_Peak_37508  74.0   cell_line   GM12878  RERE   ENCSR000AKB CNhs12333   132   0.0
...
...

到目前为止，我的代码：

def read_file(file):
    with open(file) as f:
        current = []
        for line in f: # read rest of lines
            current.append([x for x in line.split()])
    return(current)


inputfile = "/home/lside/Desktop/database_files/Cell_line_final2.bed" # 2.7GB text file
outpufile = "/home/lside/Desktop/database_files/Cell_line_final3.bed"

file_in = read_file("/home/lside/Desktop/tf_TPM2.csv") # 22.5MB text file
new_line = ""
with open(inputfile, 'r') as infile:
    with open(outpufile, 'w') as outfile:
        for line in infile:
            line = line.split("\t")
            for j in file_in:
                if j[0] == line[9] and j[3] == line[7]:
                    new_line = new_line + '{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}\t{8}\t{9}\t{10}\t{11}\n'.format(line[0], line[1], line[2],line[3], line[4], line[5],line[6], line[7], line[8], line[9], line[10].rstrip(), j[2])
                    continue
        outfile.write(new_line)

Answer 1

我同意以下意见，即这不需要30天就能运行，因此瓶颈应该在其他地方。可能最大的罪魁祸首是您正在构建的巨大字符串，而不仅仅是在每次迭代（^）时将每一行都转储到文件中。

注意

（^）最大的罪魁祸首是内部循环中的continue语句，因为这将始终迫使代码将当前行与查找文件中的所有元素进行比较，而不是停在第一场比赛。用break代替它应该是方法。

这是我要做什么，看看它执行的速度如何：

def read_file(filename):
    with open(filename) as f:
        current = []
        for line in f: # read rest of lines
            e0, e2, e3 = line.split()[0], line.split()[2], line.split()[3]
            current.append((e0, e2, e3))  # you only use these three elements
    return current


inputfile = "/home/lside/Desktop/database_files/Cell_line_final2.bed" # 2.7GB text file
outpufile = "/home/lside/Desktop/database_files/Cell_line_final3.bed"

file_in = read_file("/home/lside/Desktop/tf_TPM2.csv") # 22.5MB text file

with open(inputfile, 'r') as infile:
    with open(outpufile, 'w') as outfile:
        for line in infile:
            line = line.split("\t")
            for e0, e2, e3 in file_in:
                if e0 == line[9] and e3 == line[7]:
                    new_line = '{0}\t{1}\n'.format(line.rstrip(), e2)  # just append the column to the entire line
                    outfile.write(new_line)  # dump to file, don't linger around with an ever-growing string
                    break

查询表

如果我们想走得更远，可以从file_in中创建一个查询表。这个想法是，我们不必遍历从file_in中提取的每个元素，而是准备了一个字典，其中的字典是从j[0],j[3]（您比较的字段）中准备的，并且该字典的值是{{ 1}}。这样，查找实际上将是瞬时的，不再需要循环。

使用此逻辑的修改后的代码如下：

j[2]

Answer 2

我想提出一个使用SQL的非常规解决方案。首先，创建两个表来存储您的数据和行号。

import sqlite3

conn = sqlite3.connect(':memory:')  # you may consider file if short on RAM
c = conn.cursor()
c.execute('CREATE TABLE table1 (line INT, col1, col4);')
c.execute('CREATE TABLE table2 (line INT, col8, col10);')
conn.execute()

然后，从文件中读取行并将行写入数据库

for index, line in enumerate(open('tf_TPM2.csv')):
    tokens = line.split()
    c.execute('INSERT INTO table1 VALUES (?, ?, ?);', (index, tokens[0], tokens[3])
conn.commit()
for index, lint in enumerate(open('Cell_line_final2.bed')):
    tokens = line.split()
    c.execute('INSERT INTO table2 VALUES (?, ?, ?);', (index, tokens[7], tokens[9])
conn.commit()

最后，发出查询，该查询检查哪些行具有匹配的值并获取行号。

query = c.execute(
    'SELECT table2.line, table1.line '
    'FROM table1, table2 '
    'WHERE table1.col1 == table2.col10 AND table1.col4 == table2.col8 '
    'ORDER BY table2.line;'
)
while True:
    result = query.fetchone()
    if result is None: break
    # print result to file

结果将包含行号，但是您也可以放置和查询其他列。

Answer 3

这是另一个使用set进行查找的示例：

def main():
    f = Filter(TPM_fn='tf_TPM2.bed', final_fn='Cell_line_final2.bed',
               save_fn='Cell_line_final3.bed')

class Filter:
    def __init__(self, **kwargs):
        self.args = kwargs
        self.read_TPM()
        with open(self.args['save_fn'], 'w') as outfile:
            with open(self.args['final_fn'], 'r') as infile:
                self.read_infile(infile, outfile)


    def read_infile(self, infile, outfile):
        for line in infile:
            fields = line.split()
            key = fields[9]+fields[7]
            if key in self.tpm:
                outfile.write(line)
        return 


    def read_TPM(self):
        fn = self.args['TPM_fn']
        tpm = set()
        with open(fn) as f:
            for line in f:
                fields = line.split()
                if len(fields) != 4:
                    continue 
                key = fields[0]+fields[3]
                tpm.add(key)
        self.tpm = tpm

main()

如何并行化或制作更快的python脚本

3 个答案:

注意

查询表