什么是训练手套时文件overflow_xxxx.bin是什么意思

时间:2017-05-16 06:03:23

标签: stanford-nlp

我正在训练基于Glove方法的单词嵌入模型。虽然算法显示了一个记录器,如:

$ build/cooccur -memory 4.0 -vocab-file vocab.txt -verbose 2 -window-size 8 < /home/ignacio/data/GUsDany/corpus/GUs_regulon_pubMed.txt > cooccurrence.bin
COUNTING COOCCURRENCES
window size: 8
context: symmetric
max product: 13752509
overflow length: 38028356
Reading vocab from file "vocab.txt"...loaded 145223095 words.
Building lookup table...table contains 228170143 elements.
Processing token: 5478600000

Glove的主目录中填充了overflow_0534.bin个文件。有人能说出一切进展顺利吗?

由于

1 个答案:

答案 0 :(得分:0)

一切都好。

您可以在Github查看Glove cooccur计划的源代码。

在档案的第57行:

long long overflow_length; // Number of cooccurrence records whose product exceeds max_product to store in memory before writing to disk

如果你的语料库有太多的共现记录,那么就会有一些数据写入一些临时bin磁盘文件。

while (1) {
    if (ind >= overflow_length - window_size) { // If overflow buffer is (almost) full, sort it and write it to temporary file
        qsort(cr, ind, sizeof(CREC), compare_crec);
        write_chunk(cr,ind,foverflow);
        fclose(foverflow);
        fidcounter++;
        sprintf(filename,"%s_%04d.bin",file_head,fidcounter);
        foverflow = fopen(filename,"w");
        ind = 0;
    }

变量overflow_length取决于您的内存设置。

第463行:

if ((i = find_arg((char *)"-memory", argc, argv)) > 0) memory_limit = atof(argv[i + 1]);

第467行:

rlimit = 0.85 * (real)memory_limit * 1073741824/(sizeof(CREC));

第470行:

overflow_length = (long long) rlimit/6; // 0.85 + 1/6 ~= 1