Question

我正在编写一个多线程Python程序来处理我的数据，这里有详细信息：

此示例中有4个文件夹要处理，每个文件夹有2500个文件。

20180401：名称为1到2500的文件   20180402：名称为2501到5000的文件   20180403：名称为5001到7500的文件
  20180413：名称为7501到10000
的文件

我的程序此刻只打印文件名，实际功能被暂停。

结果：每次运行都会产生不同的结果，总是会忽略一些文件，但是，在调试之后，似乎所有10000个文件都被命中，但由于未知原因，某些文件没有打印出来。

以下是简要的代码：

    def single_day_rindex(self, solr_server, collection, flow_name, flow_days, json_loc, i):
        for json_file in self.files(json_loc):
            index_command = "\r"+ index_command_base + SOLR_URL + ' -jar ' + POST_JAR_URL + ' ' + single_day + ' ' + json_loc + '/' + json_file
            print(index_command)

    def worker_func(self, solr_server, collection, flow_name, flow_days, json_loc, i):
        sys.stdout.write("\rIn Thread " + str(i + 1))
        sys.stdout.flush()

        self.single_day_rindex(solr_server, collection, flow_name, flow_days, json_loc, i)

    def run(self):

        cur_flow_days = []
        cur_flow_days = self.read_flow_days(solr_server, collection, flow_name, flow_days)

        cur_flow_name = ''
        cur_flow_name = self.read_flow_name(flow_name)

        threads = []

        for i, each_date in enumerate(cur_flow_days):
            threads = [a for a in threads if a.isAlive()]

            while len(threads) >= MAX_THREADS:
                sleep(0.1)
                threads = [a for a in threads if a.isAlive()]

            json_loc = json_loc_base + flow_name_loc + '/' + each_date

            t = Thread(target=self.worker_func, args=(solr_server, collection, flow_name, flow_days, json_loc, i))
            threads.append(t)

            t.start()

        for t in threads:
            t.join()

输出如下：

....../20180412/9765
....../20180412/9766
....../20180412/9767

我怀疑迭代某个文件夹中每个文件的部分是否会造成麻烦，我一直在尝试不同的方法：

方法3（当前方法）：

for json_file in self.files(json_loc):
    index_command = "\r"+ index_command_base + SOLR_URL + ' -jar ' + POST_JAR_URL + ' ' + single_day + ' ' + json_loc + '/' + json_file
    print(index_command)

方法2：

#  for json_file in os.listdir(json_loc):
    #     index_command = index_command_base + SOLR_URL + ' -jar ' + POST_JAR_URL + ' ' + single_day + ' ' + json_loc + '/' + json_file
    #     print(index_command)

方法1：

# idx = 0
# for subdir, dirs, files in os.walk(json_loc):
#     for json_file in files:
#         index_command = index_command_base + SOLR_URL + ' -jar ' + POST_JAR_URL + ' ' + single_day + ' ' + json_loc + '/' + json_file
#         idx = idx + 1
#         print("\r" + str(i) + ": " + index_command)

如何确保每个线程完全处理其文件夹？如果文件按顺序处理则无关紧要。

我希望有更好的方法以更健壮的方式进行多线程处理，并且易于调试/验证。

要查看完整设置，请参阅此https://github.com/mdivk/solr_demo/tree/master/index

如何在多线程Python 2.7程序中修复丢失的记录？

0 个答案: