使用多处理的

时间:2017-06-23 14:57:01

标签: python bash parallel-processing multiprocessing hdf

我对python mutli-processing真的很陌生,并试图并行化我的代码,因为它运行时间太长。我有一个代码,它运行大量数据,以查找是否有任何文件损坏。到目前为止,我的代码是:

def check_Corrupt_1(dirPath, logfile):

    fileCheck = open(logfile, "w").close()
    fileCheck = open(logfile, "w")

    emptydir = []
    zero_size = {}
    #entering the year to be checked (day number)
    for fname in os.listdir(dirPath):

        if(os.listdir(os.path.join(dirPath, fname)) == []):
            emptydir.append(fname)

        else:

            #this makes sure that we do not enter an empty directory
            if fname not in emptydir:
                inPath = os.path.join(dirPath, fname)

                for filename in os.listdir(inPath):
                    hdfinfo = os.stat(os.path.join(inPath, filename))

                    if(hdfinfo.st_size == 0):
                        zero_size[filename] = True

                    else:

                        filepath = "/path/to/file"

                        strin = subprocess.Popen(["hdp", "dumpsds", "-h", os.path.join(inPath, filename)], stdout=subprocess.PIPE).communicate()[0]
                        #print(strin)
                        cmd = 'echo $?'
                        callno = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE)
                        #print(int(callno.stdout.read()[0]))
                        if(int(callno.stdout.read()[0]) != 0):
                            fileCheck.write(os.path.join(inPath, filename) + '\n')

我每年有365个目录,每个目录包含很多要检查的文件。我正在运行bash命令来检查文件是否损坏,但是因为我运行的bash命令有很长的输出,这段代码需要花费很多时间才能运行。我希望并行化有助于加快速度,但不了解如何做到这一点。除了多处理之外,还有其他方法可以让它更快吗?我将不胜感激任何帮助。

1 个答案:

答案 0 :(得分:1)

从您的文章和您发布的代码段的简要介绍中,似乎大部分繁重的工作似乎是通过hdp命令完成的。这就是你想要parallelize的那个。 你似乎在做的是打开一个子流程。 您也可以尝试使用线程。您的代码将是这样的

#!/usr/bin/python
import thread
from subprocess import call

def check_Corrupt_1(dirPath, logfile):

    fileCheck = open(logfile, "w").close()
    fileCheck = open(logfile, "w")

    emptydir = []
    zero_size = {}
    #entering the year to be checked (day number)
    for fname in os.listdir(dirPath):

        if(os.listdir(os.path.join(dirPath, fname)) == []):
            emptydir.append(fname)

        else:

            #this makes sure that we do not enter an empty directory
            if fname not in emptydir:
                inPath = os.path.join(dirPath, fname)

                for filename in os.listdir(inPath):
                    hdfinfo = os.stat(os.path.join(inPath, filename))

                    if(hdfinfo.st_size == 0):
                        zero_size[filename] = True

                    else:
                        try:    
                            thread.start_new_thread(call(["hdp", "dumpsds", "-h"]))
                        except:
                            print "Error generating thread"

                        if(int(callno.stdout.read()[0]) != 0):
                            fileCheck.write(os.path.join(inPath, filename) + '\n')