Question

我有两个单元格数组，每个单元格存储unigram和bigrams，我从文本文件中提取。现在我必须将每个unigram与bigram进行比较，以找出bigram中存在的unigram的计数和后来的可能性。任何人都可以帮助我如何解决这个问题，我已经使用过strcmp但它不起作用。我正在编写下面的代码：

for i = 1
    for j = 1:bigramRow
       bigram1 = regexp(splitBigramCellsA{j},'<s>|\w*|</s>','match');
       b1 = cellfun(@(x,y)[x], bigram1(1:end-1)','un',0)
       match = strcmp(splitUnigramCellsA, splitBigramCellsA{j,1});

        if match ==1
           bigram1count = splitbigramCellsB{j};
            unigram1count = splitUnigramCellsB{j};
            disp(bigram1count)
            disp(unigram1count)
        end
 end
end

Answer 1

如果您可以将文本放入内存中，则可以执行以下操作：

创建所有单词的单元格数组（按顺序）
在单元阵列上调用unique，并捕获第三个输出。这是表示为索引数组的原始文本，其中每个索引引用一个unigram。
将所有双字母组合设为bigrams = [indices(1:2:largestEven),indices(2:2:largestEven);indices(2:2:largestOdd),indices(3:2:largestOdd)]，其中largestEven为2*floor(length(indices)/2)，largestOdd为2*floor((length(indices)+1)/2)+1。
计算，例如在双字母组合中每个unigram的频率为tabulate(bigrams(:))

matlab中细胞阵列的比较

1 个答案: