我正在开发一个处理巨大json文件的程序,并在插入db之前进行一些分析 在开始时,我的程序原型将json文件分成n部分。然后他们通过脚本独立运行:
python data_import.py --start 1 --cluster 6
python data_import.py --start 2 --cluster 6
python data_import.py --start 3 --cluster 6...
性能非常好,但是当我必须运行它时创建这么多标签真是太烦人了。 因此,我用这样的多处理修改了程序。
def main():
parser = argparse.ArgumentParser(
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('--cluster', type=int, default=5,
help='number of clusters')
parser.add_argument('--total', type=int, default=5701017,
help='total number of data')
parser.add_argument('--json_path', type=str, default="../Data/output.json",
help='location of data source')
args = parser.parse_args()
manager = Manager()
work_count = manager.list()
for i in range(0, args.cluster):
work_count.append(0)
p_logger = setup_logger("Workers", 'done' + '_' + str(args.cluster) + '.log',
logging.INFO)
try:
processes = []
for c in range(1, args.cluster + 1):
p = Process(target=update_extractor_result, args=(args, c, work_count, p_logger))
processes.append(p)
# Start the processes
for p in processes:
p.start()
# Ensure all processes have finished execution
for p in processes:
p.join()
except Exception as e:
print("Error: unable to start thread")
def update_extractor_result(args, num_start, work_count, p_logger):
logger = setup_logger(__name__, 'error' + str(num_start) + '_' + str(args.cluster) + '.log', logging.ERROR)
batch = 1
total_loaded_count = 0
total = args.total - 1
total_works = int(total / args.cluster)
done_count = 0
startfrom = int(total_works * (num_start - 1))
endfrom = int(total_works * (num_start))
json_path = args.json_path
with open(json_path, 'r', encoding="utf8") as f:
for line in f:
try:
data = json.loads(line)
if total_loaded_count % 100 == 0:
p_logger.info("Workers: " + str(work_count))
total_loaded_count += 1
work_count[num_start - 1] += 1
if total_loaded_count >= startfrom and total_loaded_count <= endfrom:
data = data.doAnalysis()
insertToDB()
print("Done batch " + str(batch) + " - count: " + str(done_count))
batch += 1
与使用多处理同时在多个选项卡中运行相同程序相比,如果有6个集群,则前者需要6-8小时,而另一个需要在12小时内完成1/5处理。
他们为什么这么多差异?或者我的程序有一些问题?