Question

我有一个词汇表（字符串矢量）和一个充满句子的文件。我想构建一个矩阵，显示每个句子包含每个单词的频率。我目前的实施速度非常慢，我相信这可以更快。一句约10个单词大约需要一分钟。

你能解释一下这是为什么以及如何加快它的速度吗？

注意：我使用稀疏矩阵，因为它不适合内存。词汇量大约为10,000个单词。运行程序不会耗尽我的工作记忆，所以这不是问题。

这是相关代码。之前未提及的变量已初始化，如totalLineCount，vocab和vocabCount。

% initiate sentence structure
wordSentenceMatrix = sparse(vocabCount, totalLineCount);
% fill the sentence structure
fid = fopen(fileLocation, 'r');
lineCount = 0;
while ~feof(fid),
    line = fgetl(fid);
    lineCount = lineCount + 1;
    line = strsplit(line, " ");
    % go through each word and increase the corresponding value in the matrix
    for j=1:size(line,2),
        for k=1:vocabCount,
            w1 = line(j);
            w2 = vocab(k);
            if strcmp(w1, w2),
                wordSentenceMatrix(k, lineCount) = wordSentenceMatrix(k, lineCount) + 1;
            end;
        end;
    end;
end;

Answer 1

稀疏矩阵实际上存储在内存中的三个数组中。在简化语言中，您可以将其存储描述为一个行索引数组，一个列索引数组和一个非零条目值数组。（一个更复杂的故事称为compressed sparse column。）

通过在代码中逐元素扩展稀疏矩阵，您将重复更改该矩阵（或稀疏模式）的结构。建议不要这样做，因为它涉及大量内存复制。

查询词汇表中单词索引的方式也很慢，因为对于句子中的每个单词，您都要经历整个词汇表。更好的方法是在Matlab中使用Java HashMap。

我将您的代码修改为以下内容：

rowIdx = [];
colIdx = [];
vocabHashMap = java.util.HashMap;
for k = 1 : vocabCount
    vocabHashMap.put(vocab{k}, k);
end

fid = fopen(fileLocation, 'r');
lineCount = 0;
while ~feof(fid),
    line = fgetl(fid);
    lineCount = lineCount + 1;
    line = strsplit(line, " ");
    % go through each word and increase the corresponding value in the matrix
    for j = 1 : length(line)
        rowIdx = [rowIdx; vocabHashMap.get(line{j})];
        colIdx = [colIdx; lineCount];
    end
end
assert(length(rowIdx) == length(colIdx));
nonzeros = length(rowIdx);
wordSentenceMatrix = sparse(rowIdx, colIdx, ones(nonzeros, 1));

当然，如果您事先了解文本集的长度，则应预先分配rowIdx和colIdx的内存：

rowIdx = zeros(nonzeros, 1);
colIdx = zeros(nonzeros, 1);

如果可以，请将其移至Octave。

构造单词矩阵时，八度音程非常慢

1 个答案: