我对python mutli-processing真的很陌生,并试图并行化我的代码,因为它运行时间太长。我有一个代码,它运行大量数据,以查找是否有任何文件损坏。到目前为止,我的代码是:
def check_Corrupt_1(dirPath, logfile):
fileCheck = open(logfile, "w").close()
fileCheck = open(logfile, "w")
emptydir = []
zero_size = {}
#entering the year to be checked (day number)
for fname in os.listdir(dirPath):
if(os.listdir(os.path.join(dirPath, fname)) == []):
emptydir.append(fname)
else:
#this makes sure that we do not enter an empty directory
if fname not in emptydir:
inPath = os.path.join(dirPath, fname)
for filename in os.listdir(inPath):
hdfinfo = os.stat(os.path.join(inPath, filename))
if(hdfinfo.st_size == 0):
zero_size[filename] = True
else:
filepath = "/path/to/file"
strin = subprocess.Popen(["hdp", "dumpsds", "-h", os.path.join(inPath, filename)], stdout=subprocess.PIPE).communicate()[0]
#print(strin)
cmd = 'echo $?'
callno = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE)
#print(int(callno.stdout.read()[0]))
if(int(callno.stdout.read()[0]) != 0):
fileCheck.write(os.path.join(inPath, filename) + '\n')
我每年有365个目录,每个目录包含很多要检查的文件。我正在运行bash命令来检查文件是否损坏,但是因为我运行的bash命令有很长的输出,这段代码需要花费很多时间才能运行。我希望并行化有助于加快速度,但不了解如何做到这一点。除了多处理之外,还有其他方法可以让它更快吗?我将不胜感激任何帮助。
答案 0 :(得分:1)
从您的文章和您发布的代码段的简要介绍中,似乎大部分繁重的工作似乎是通过hdp
命令完成的。这就是你想要parallelize
的那个。
你似乎在做的是打开一个子流程。
您也可以尝试使用线程。您的代码将是这样的
#!/usr/bin/python
import thread
from subprocess import call
def check_Corrupt_1(dirPath, logfile):
fileCheck = open(logfile, "w").close()
fileCheck = open(logfile, "w")
emptydir = []
zero_size = {}
#entering the year to be checked (day number)
for fname in os.listdir(dirPath):
if(os.listdir(os.path.join(dirPath, fname)) == []):
emptydir.append(fname)
else:
#this makes sure that we do not enter an empty directory
if fname not in emptydir:
inPath = os.path.join(dirPath, fname)
for filename in os.listdir(inPath):
hdfinfo = os.stat(os.path.join(inPath, filename))
if(hdfinfo.st_size == 0):
zero_size[filename] = True
else:
try:
thread.start_new_thread(call(["hdp", "dumpsds", "-h"]))
except:
print "Error generating thread"
if(int(callno.stdout.read()[0]) != 0):
fileCheck.write(os.path.join(inPath, filename) + '\n')