我正在尝试提高代码的性能,并且无法弄清楚如何在其中实现多处理模块。
我正在使用linux(CentOS 7.2)和python 2.7
我需要在并行环境中运行的代码:
def start_fetching(directory):
with open("test.txt", "a") as myfile:
try:
for dirpath, dirnames, filenames in os.walk(directory):
for current_file in filenames:
current_file = dirpath + "/" + current_file
myfile.write(current_file)
return 0
except:
return sys.exc_info()[0]
if __name__ == "__main__":
cwd = "/home/"
final_status = start_fetching(cwd)
exit(final_status)
我需要在数据库中保存所有文件的元数据(这里只显示文件名)。这里我只将文件名存储在文本文件中。
答案 0 :(得分:1)
我想你想要并行化很大的任务。你提供的只是文件名到文件中。 我为每个线程输出创建了一个单独的文件,之后您也可以组合所有这些文件。还有其他方法可以实现这一目标。
如果主要问题是并行化,下面可能是一个解决方案。
Python支持多线程和多处理。多线程并不是真正的并行处理,在IO块的情况下,我们可以进行并行执行。如果您想并行编写代码,请使用多处理[https://docs.python.org/2/library/multiprocessing.html]。您的代码可能如下所示。
from multiprocessing import Process
def task(filename):
with open(filename+"test.txt", "a") as myfile:
myfile.write(filename)
def start_fetching(directory):
try:
processes = []
for dirpath, dirnames, filenames in os.walk(directory):
for current_file in filenames:
current_file = dirpath + "/" + current_file
# Create Seperate process and do what you want, becausee Multi-threading wont help in parallezing
p = Process(target=f, args=(current_file,))
p.start()
processes.append(p)
# Let all the child processes finish and do some post processing if needed.
for process in processes:
process.join()
return 0
except:
return sys.exc_info()[0]
if __name__ == "__main__":
cwd = "/home/"
final_status = start_fetching(cwd)
exit(final_status)
答案 1 :(得分:0)
感谢大家帮助我将这个脚本的处理时间缩短到近一半。 (我将此作为答案添加,因为我可以在评论中添加这么多内容)
我找到了两种方法来实现我的目标:
使用@KeerthanaPrabhakaran提到的this链接,它与多线程有关。
def worker(filename):
subprocess_out = subprocess.Popen(["stat", "-c",
"INSERT INTO file VALUES (NULL, \"%n\", '%F', %s, %u, %g, datetime(%X, 'unixepoch', 'localtime'), datetime(%Y, 'unixepoch', 'localtime'), datetime(%Z, 'unixepoch', 'localtime'));", filename], stdout=subprocess.PIPE)
return subprocess_out.communicate()[0]
def start_fetching(directory, threads):
filename = fetch_filename() + ".txt"
with contextlib.closing(multiprocessing.Pool(threads)) as pool: # pool of threads processes
with open(filename, "a") as myfile:
walk = os.walk(directory)
fn_gen = itertools.chain.from_iterable((os.path.join(root, file) for file in files) for root, dirs, files in walk)
results_of_work = pool.map(worker, fn_gen) # this does the parallel processing
print "Concatenating the result into the text file"
for result in results_of_work:
myfile.write(str(result))
return filename
这是在0m15.154s中遍历15203个文件。
@ArunKumar提到的第二个问题与多处理有关:
def task(filename, process_no, return_dict):
subprocess_out = subprocess.Popen(["stat", "-c",
"INSERT INTO file VALUES (NULL, \"%n\", '%F', %s, %u, %g, datetime(%X, 'unixepoch', 'localtime'), datetime(%Y, 'unixepoch', 'localtime'), datetime(%Z, 'unixepoch', 'localtime'));",
filename], stdout=subprocess.PIPE)
return_dict[process_no] = subprocess_out.communicate()[0]
def start_fetching_1(directory):
try:
processes = []
i = 0
manager = multiprocessing.Manager()
return_dict = manager.dict()
for dirpath, dirnames, filenames in os.walk(directory):
for current_file in filenames:
current_file = dirpath + "/" + current_file
# Create Seperate process and do what you want, becausee Multi-threading wont help in parallezing
p = multiprocessing.Process(target=task, args=(current_file, i, return_dict))
i += 1
p.start()
processes.append(p)
# Let all the child processes finish and do some post processing if needed.
for process in processes:
process.join()
with open("test.txt", "a") as myfile:
myfile.write(return_dict.values())
return 0
except:
return sys.exc_info()[0]
这是在1m12.197s中遍历15203个文件
我不明白为什么多处理需要花费那么多时间(我的初始代码只花费0m27.884s ),但是使用了几乎100%的CPU。
以上代码是我正在运行的确切代码,(我将这些信息存储在一个文件中,而不是使用这些test.txt文件来创建数据库条目)
我正在尝试进一步优化上述代码,但无法想出更好的方法,正如@CongMa所提到的,它可能最终会出现在I / O瓶颈中。