Question

我正在尝试读取一个巨大的文本文件并计算每个字母的频率，然后我想找到每个字母的概率分布。这是我到目前为止所尝试的：

f = fopen('c:\words.txt');
ns = textscan(f, '%s');
fclose(f);

counts = hist(num, 1:26); 
prob = counts / numel(ns{:})

任何提示，帮助，工作代码？

我也试过这段代码，但答案不准确

fid = fopen('c:\words.txt');
c = fread(fid);
fclose(fid);


y = unique(c);
counts = histc(c, y);

我想获得如下结果：

a = 2338 times
b = 4533 times 
c = 1233 times

等...

的问候，

Answer 1

对于大型文本文件，您可能希望避免使用hist或histc。

<强>代码

%// Convert everything to chars
letters_char = reshape(char(ns{:}),[],1);

%// Get the case-insensitive count of each letter 
count_lettters = sum(bsxfun(@eq,letters_char,97:122),1) + ...
    sum(bsxfun(@eq,letters_char,65:90),1)

最后，要获得概率分布，请使用plot(count_lettters./sum(count_lettters))或bar(count_lettters./sum(count_lettters))，无论哪个看起来都更好。

然后，如果您想为每个字母的概率添加标签，请使用set(gca, 'XTickLabel',cellstr(char(97:122)'),'XTick',1:26)。 Source

样本图 -

enter image description here

现在，这是一个随机文本文件，它至少显示了一个有趣的事实：'e'可能是典型文本中最常出现的字母。

Answer 2

这可以将waonce中的所有字符读入数组A

fileID = fopen('words.txt','r');
A = fscanf(fileID, '%c');   % this also works for unicode characters.
fclose(fileID);

使用Map，您可以计算所有字符的出现次数：

for i = 1:numel(A)

    if isKey(keyMap, A(i))
        keyMap(A(i)) = keyMap(A(i)) + 1;
    else
        keyMap(A(i)) = 1;
    end        
end

使用matlab计算大文本文件中每个字符的频率

2 个答案: