我正在训练基于Glove方法的单词嵌入模型。虽然算法显示了一个记录器,如:
$ build/cooccur -memory 4.0 -vocab-file vocab.txt -verbose 2 -window-size 8 < /home/ignacio/data/GUsDany/corpus/GUs_regulon_pubMed.txt > cooccurrence.bin
COUNTING COOCCURRENCES
window size: 8
context: symmetric
max product: 13752509
overflow length: 38028356
Reading vocab from file "vocab.txt"...loaded 145223095 words.
Building lookup table...table contains 228170143 elements.
Processing token: 5478600000
Glove的主目录中填充了overflow_0534.bin
个文件。有人能说出一切进展顺利吗?
由于
答案 0 :(得分:0)
一切都好。
您可以在Github查看Glove cooccur计划的源代码。
在档案的第57行:
long long overflow_length; // Number of cooccurrence records whose product exceeds max_product to store in memory before writing to disk
如果你的语料库有太多的共现记录,那么就会有一些数据写入一些临时bin磁盘文件。
while (1) {
if (ind >= overflow_length - window_size) { // If overflow buffer is (almost) full, sort it and write it to temporary file
qsort(cr, ind, sizeof(CREC), compare_crec);
write_chunk(cr,ind,foverflow);
fclose(foverflow);
fidcounter++;
sprintf(filename,"%s_%04d.bin",file_head,fidcounter);
foverflow = fopen(filename,"w");
ind = 0;
}
变量overflow_length
取决于您的内存设置。
第463行:
if ((i = find_arg((char *)"-memory", argc, argv)) > 0) memory_limit = atof(argv[i + 1]);
第467行:
rlimit = 0.85 * (real)memory_limit * 1073741824/(sizeof(CREC));
第470行:
overflow_length = (long long) rlimit/6; // 0.85 + 1/6 ~= 1