Question

想法如下：我需要读取1000多个文件（基因表达谱）并将各自的数据矩阵插入到通用字典中。使用for循环执行此操作很简单，但也非常慢（最终字典可能像10 Gb）。因此，我查看了多处理模块，但是得到的是None的列表，并且全局字典为空。在下面的代码中，我当前正在使用。

import os
import GEOparse
import mygene as mg
import pandas as pd
import numpy as np
from multiprocessing import Pool

PATH = '../../data/GEO/'
directory = os.fsencode(PATH)

def get_exprs(filename):
    global dic_df
    global j

    # Import dataset
    gse = GEOparse.get_GEO(filepath=f'{PATH}{filename}', silent=True)

    # Create PD dataframe for each GEO entry (for each use object)
    i = 0
    for name, gsm in gse.gsms.items():
        if (i==0): #We are reading a new file
            dic_df[str(j)] = pd.DataFrame(data=gsm.table.iloc[0:, 0])

        temp = pd.Series(gsm.table.iloc[0:, 1])
        dic_df[str(j)].insert(i+1, str(gsm), temp)
        i += 1 #Update column of matrix to be added to the dataframe
    j += 1 #Update entry dictionary

if __name__ == '__main__':
    pool = Pool(2)
    dic_df = {} #Global dictionary containing matrices
    j = 0 #Initial value for the dictionary
    list_files = [os.fsdecode(file) for file in os.listdir(directory) if file!=b'.DS_Store'] #List of all the files to analyze

    print(pool.map(get_exprs, list_files, chunksize=2))
    pool.close()
    pool.join()

在说什么之前，我知道没有输入数据可以对其进行测试，但是我不知道如何与您共享这些文件。我需要了解为什么此代码无法正常工作（实际上，根本没有）。

Python多重处理可从许多文件创建数据框架

0 个答案: