我一直在尝试多处理模块,以将文本文件列表转换为BERT嵌入。
对于每个文件,都会创建BERT嵌入,但是对于特定文件,该过程不会完成。
我以前使用进程.join()
操作来完成进程,但以前陷入僵局。
from multiprocessing import Process
import multiprocessing
import time
import sys
def process(file,appended_data):
start = datetime.now()
file1_obj = open(form_path + file, 'r')
file1 = file1_obj.readlines()
file1_obj.close()
file11=[i.rstrip() for i in file1 if not(bool(not i or i.isspace()))]
file111=[' |||'.join(file11)]
try:
bc=BertClient()
embedding1=bc.encode(file111)
del bc
except ValueError: #some files have '' as their first strins in the list
embedding1=None
appended_data.put({file:embedding1})
print("finished %s"%file)
print(datetime.now()-start)
return appended_data
def embedding_dic(file_list):
procs = []
appended_data = multiprocessing.Queue()
print(file_list[0])
print(file_list)
for file in file_list:
procs.append(Process(target=process, args=(file,appended_data,)))
for proc in procs:
proc.start()
results = []
liveprocs = list(procs)
while liveprocs:
try:
while 1:
r=appended_data.get(False)
results.append(r)
except Exception:
pass
time.sleep(0.05) # Give tasks a chance to put more data in
if not appended_data.empty():
continue
liveprocs = [p for p in liveprocs if p.is_alive()]
print(liveprocs)
print(len(results))
return results
对于某些文件,仍然会发生死锁。
说明如下:
对文件列表执行embedding_dic
函数会导致
No of files available : 7
Files _names:
['0001368007_10-K_2007-03-22.txt', '0001368007_10-K_2008-03-25.txt', '0001368007_10-K_2009-02-27.txt', '0001368007_10-K_2010-03-01.txt', '0001368007_10-K_2011-02-28.txt', '0001368007_10-K_2012-02-29.txt', '0001368007_10-K_2012-02-29.txt']
Processes_started:
[<Process(Process-1899, started)>, <Process(Process-1900, started)>, <Process(Process-1901, started)>, <Process(Process-1902, started)>, <Process(Process-1903, started)>, <Process(Process-1904, started)>, <Process(Process-1905, started)>]
0
[<Process(Process-1899, started)>, <Process(Process-1900, started)>, <Process(Process-1901, started)>, <Process(Process-1902, started)>, <Process(Process-1903, started)>, <Process(Process-1904, started)>, <Process(Process-1905, started)>]
0
[<Process(Process-1899, started)>, <Process(Process-1900, started)>, <Process(Process-1901, started)>, <Process(Process-1902, started)>, <Process(Process-1903, started)>, <Process(Process-1904, started)>, <Process(Process-1905, started)>]
0
[<Process(Process-1899, started)>, <Process(Process-1900, started)>, <Process(Process-1901, started)>, <Process(Process-1902, started)>, <Process(Process-1903, started)>, <Process(Process-1904, started)>, <Process(Process-1905, started)>]
0
[<Process(Process-1899, started)>, <Process(Process-1900, started)>, <Process(Process-1901, started)>, <Process(Process-1902, started)>, <Process(Process-1903, started)>, <Process(Process-1904, started)>, <Process(Process-1905, started)>]
0
finished 0001368007_10-K_2009-02-27.txt
0:00:03.055049
finished 0001368007_10-K_2012-02-29.txt
0:00:03.023879
finished 0001368007_10-K_2012-02-29.txt
0:00:03.055496
finished 0001368007_10-K_2010-03-01.txt
0:00:03.096127
finished 0001368007_10-K_2011-02-28.txt
0:00:03.099099
[<Process(Process-1899, started)>, <Process(Process-1900, started)>]
5
finished 0001368007_10-K_2008-03-25.txt
0:00:04.473414
[<Process(Process-1899, started)>]
6
[<Process(Process-1899, started)>]
6
[<Process(Process-1899, started)>]
6
[<Process(Process-1899, started)>]
6
[<Process(Process-1899, started)>]
6
[<Process(Process-1899, started)>]
6
[<Process(Process-1899, started)>]
6
[<Process(Process-1899, started)>]
6
[<Process(Process-1899, started)>]
6
[<Process(Process-1899, started)>]
6
[<Process(Process-1899, started)>]
6
[<Process(Process-1899, started)>]
6
[<Process(Process-1899, started)>]
6
[<Process(Process-1899, started)>]
6
[<Process(Process-1899, started)>]
6
[<Process(Process-1899, started)>]
6
[<Process(Process-1899, started)>]
6
[<Process(Process-1899, started)>]
6
[<Process(Process-1899, started)>]
6
[<Process(Process-1899, started)>]
6
[<Process(Process-1899, started)>]
6
[<Process(Process-1899, started)>]
6
[<Process(Process-1899, started)>]
6
[<Process(Process-1899, started)>]
6
[<Process(Process-1899, started)>]
6
[<Process(Process-1899, started)>]
6
[<Process(Process-1899, started)>]
6
[<Process(Process-1899, started)>]
6
[<Process(Process-1899, started)>]
6
[<Process(Process-1899, started)>]
6
Process Process-1899:
File "/home/jovyan/.conda/envs/pycp_py3k/lib/python3.6/site-packages/bert_serving/client/__init__.py", line 206, in arg_wrapper
return func(self, *args, **kwargs)
[<Process(Process-1899, started)>]
6
Traceback (most recent call last):
File "/home/jovyan/.conda/envs/pycp_py3k/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/jovyan/.conda/envs/pycp_py3k/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "<ipython-input-315-ffe782d1c2f5>", line 12, in process
embedding1=bc.encode(file111)
File "/home/jovyan/.conda/envs/pycp_py3k/lib/python3.6/site-packages/bert_serving/client/__init__.py", line 291, in encode
r = self._recv_ndarray(req_id)
因此,当提供文件列表作为输入时,此过程将陷入文件0001368007_10-K_2007-03-22.txt的死锁。
以防万一,我只尝试使用相同的文件作为输入。完成。
即使文件数量不超过5个,它也会完成。
甚至对于文件数量超过7(例如10或12)的文件列表也不同。
我无法对其进行调试。
我观察到的另一种症状
帮助表示赞赏。