Python线程或多处理比同时在许多选项卡中运行相同程序要慢

时间:2018-03-24 08:33:04

标签: python multiprocessing python-multithreading

我正在开发一个处理巨大json文件的程序,并在插入db之前进行一些分析 在开始时,我的程序原型将json文件分成n部分。然后他们通过脚本独立运行:

python data_import.py --start 1 --cluster 6
python data_import.py --start 2 --cluster 6
python data_import.py --start 3 --cluster 6...

性能非常好,但是当我必须运行它时创建这么多标签真是太烦人了。 因此,我用这样的多处理修改了程序。

def main():
    parser = argparse.ArgumentParser(
        formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    parser.add_argument('--cluster', type=int, default=5,
                        help='number of clusters')
    parser.add_argument('--total', type=int, default=5701017,
                        help='total number of data')
    parser.add_argument('--json_path', type=str, default="../Data/output.json",
                        help='location of data source')

    args = parser.parse_args()

    manager = Manager()
    work_count = manager.list()
    for i in range(0, args.cluster):
        work_count.append(0)

    p_logger = setup_logger("Workers", 'done' + '_' + str(args.cluster) + '.log',
                            logging.INFO)
    try:
         processes = []
         for c in range(1, args.cluster + 1):
             p = Process(target=update_extractor_result, args=(args, c, work_count, p_logger))
             processes.append(p)
         # Start the processes
         for p in processes:
             p.start()
         # Ensure all processes have finished execution
         for p in processes:
             p.join()          
    except Exception as e:
        print("Error: unable to start thread")


def update_extractor_result(args, num_start, work_count, p_logger):
    logger = setup_logger(__name__, 'error' + str(num_start) + '_' + str(args.cluster) + '.log', logging.ERROR)

    batch = 1

    total_loaded_count = 0
    total = args.total - 1
    total_works = int(total / args.cluster)
    done_count = 0
    startfrom = int(total_works * (num_start - 1))
    endfrom = int(total_works * (num_start))

    json_path = args.json_path

    with open(json_path, 'r', encoding="utf8") as f:
        for line in f:
            try:
                data = json.loads(line)

                if total_loaded_count % 100 == 0:
                    p_logger.info("Workers: " + str(work_count))
                total_loaded_count += 1
                work_count[num_start - 1] += 1

                if total_loaded_count >= startfrom and total_loaded_count <= endfrom:
                    data = data.doAnalysis()
                    insertToDB()
                    print("Done batch " + str(batch) + " - count: " + str(done_count))
                    batch += 1

与使用多处理同时在多个选项卡中运行相同程序相比,如果有6个集群,则前者需要6-8小时,而另一个需要在12小时内完成1/5处理。

他们为什么这么多差异?或者我的程序有一些问题?

0 个答案:

没有答案