Question

我正在寻找一个解决方案来批量运行一组数据的python命令。例如，我想为前10行运行下面提到的代码，打印输出并运行下一批，直到行结束。这样做的原因是，目前运行1000行需要花费大量时间。

尝试使用concurrent.futures.ProcessPoolExecutor，但没有任何帮助。有更好的方法吗？

这是代码：

import os, sys
import xlwt
import numpy

import tensorflow as tf
import xlsxwriter
import urllib

filename = "/home/shri/Desktop/tf_files/test1"

def getimg(count):
# open file to read
with open("{0}.csv".format(filename), 'r') as csvfile:
# iterate on all lines
i = 0
for line in csvfile:
    splitted_line = line.split(',')
    # check if we have an image URL
    if splitted_line[1] != '' and splitted_line[1] != "\n":
        urllib.urlretrieve(splitted_line[1], '/home/shri/Desktop/tf_files/images/{0}.jpg'.format (splitted_line[0]))
        print "Image saved for {0}".format(splitted_line[0])
        i += 1
    else:
        print "No result for {0}".format(splitted_line[0])

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

def run_inference(count):
# Create a workbook and add a worksheet.
workbook = xlsxwriter.Workbook('output.xlsx') 
worksheet = workbook.add_worksheet()
# Start from the first cell. Rows and columns are zero indexed.
row = 0
col = 0

# search for files in 'images' dir
files_dir = os.getcwd() + '/images'
files = os.listdir(files_dir)

# loop over files, print prediction if it is an image
for f in files:
if f.lower().endswith(('.png', '.jpg', '.jpeg')):
        image_path = files_dir + '/' + f

        # Read in the image_data
        image_data = tf.gfile.FastGFile(image_path, 'rb').read()

        # Loads label file, strips off carriage return
        label_lines = [line.rstrip() for line
                    in tf.gfile.GFile("retrained_labels.txt")]

# Unpersists graph from file
with tf.gfile.FastGFile("retrained_graph.pb", 'rb') as f:
        graph_def = tf.GraphDef()
        graph_def.ParseFromString(f.read())
        tf.import_graph_def(graph_def, name='')

 with tf.Session() as sess:
    # Feed the image_data as input to the graph and get first prediction
           softmax_tensor = sess.graph.get_tensor_by_name('final_result:0')

           predictions = sess.run(softmax_tensor, \
                              {'DecodeJpeg/contents:0': image_data})

  # Sort to show labels of first highest prediction in order of confidence
  top_k = predictions[0].argsort()[-len(predictions):][::-1]

  for node_id in top_k:
        human_string = label_lines[node_id]
        score = predictions[0][node_id]

        worksheet.write_string(row, 1, image_path)
        worksheet.write(row, 2,  human_string)
        worksheet.write(row, 3, score)
        print(row)
        print(node_id)
        print(image_path)
        print('%s (score = %.5f)' % (human_string, score))
        row +=1

workbook.close()

with concurrent.futures.ThreadPoolExecutor(max_workers=5) as e:
    for i in range(10):
        e.submit(run_inference, i)

这是excel表中的数据

Answer 1

我建议使用GNU Parallel。创建一个文本文件，每行都是您需要运行的命令，例如

python mycode.py someargs
python mycode.py someotherargs
...

然后只需运行

parallel commands.txt -j 8

它将并行处理整个命令列表中的8个（或多个你选择的）脚本实例。

Answer 2

GNU Parallel无法使串行程序运行得更快或将串行程序更改为并行程序。

GNU Parallel 可以做什么，是使用不同的参数并行运行一个串行程序多次。但要实现这一点，您需要使您的串行程序能够并行运行并能够分解工作。

所以你需要让你的串行程序能够解决问题的一部分并解决它。这可能意味着您最终需要将所有部分解决方案收集到一个完整的解决方案中。

这种技术今天被称为Map-Reduce。 GNU Parallel执行Map-stage。

在您的情况下，最好确定哪个部分很慢，并了解如何将该部分更改为可以作为部分解决方案运行的部分。

让我们假设这是一个缓慢的URL提取。然后你创建一个程序来获取URL号 i ，并且可以在命令行上提供 i ：

seq 10000 | parallel -j30 python get_url_number.py {}

这里我们并行运行30个工作。这通常不会导致网络服务器崩溃，并且可能会填满你的带宽。

如何批量运行python脚本？

2 个答案: