无法分配内存,并且所有作业池未追加某些作业失败
我所做的:获得BERT嵌入(3072尺寸),并且现在在多处理池上运行分层集群,但是它占用了大量内存,并且作业失败。服务器分配有48 GB的空间,不再分配更多空间 ?
Data in pickle files which contains list of lists
Ex : [[emb],[tokens],[doc embs]]
emb_0.pkl
function running
62204
emb_1.pkl 运行功能 66505
slice_emb = []
slice_tokens = []
for each in pickled_filepaths:
print(each)
with open('folder'+each, 'rb') as f:
chunk = pickle.load(f)
emb, token = get_df_emb(chunk)#Get Dataframe function will get the things
slice_emb.extend(np.array_split(emb, 8)) #slicing into 8
slice_tokens.extend(np.array_split(token, 8)) #slicing into 8
print(len(emb))
import pickle
from datetime import datetime
start=datetime.now()
pool = multiprocessing.Pool(16)
jobs_pool =[]
for x,y in zip(slice_emb, slice_tokens):
print(x.shape)
print(y.shape)
pool_chunk = pool.apply_async(cluster_function,[x,y]) #Getting clusters
jobs_pool.append(pool_chunk)
df_list_pool = []
i =0
for j in jobs_pool:
df_list_pool.append(j.get())
print(df_list_pool[i].shape)
i +=1
end= datetime.now()
print(end-start)
df_list_pools[] should get all the data in the below format
a_type B_type cluster token
0 2411 g 1.0 a
26 9956 g 1.0 b
27 24323 g 1.0 awq
28 3460 g 1.0 bw
226 9732 g 1.0 cp