Question

我有一系列 n = 400 不同长度的序列，包含字母ACGTE。例如， A 之后 C 的概率为：

enter image description here

并且可以从经验序列集合计算，因此

enter image description here

假设： enter image description here

然后我得到一个转换矩阵：

enter image description here

但是我有兴趣计算Phat的置信区间，对于我怎样才能解决它的想法？

Answer 1

您可以使用bootstrapping来估算confidence intervals。 MATLAB在统计工具箱中提供bootci功能。这是一个例子：

%# generate a random cell array of 400 sequences of varying length
%# each containing indices from 1 to 5 corresponding to ACGTE
sequences = arrayfun(@(~) randi([1 5], [1 randi([500 1000])]), 1:400, ...
    'UniformOutput',false)';

%# compute transition matrix from all sequences
trans = countFcn(sequences);

%# number of bootstrap samples to draw
Nboot = 1000;

%# estimate 95% confidence interval using bootstrapping
ci = bootci(Nboot, {@countFcn, sequences}, 'alpha',0.05);
ci = permute(ci, [2 3 1]);

我们得到：

>> trans         %# 5x5 transition matrix: P_hat
trans =
      0.19747       0.2019      0.19849       0.2049      0.19724
      0.20068      0.19959      0.19811      0.20233      0.19928
      0.19841      0.19798       0.2021       0.2012      0.20031
      0.20077      0.19926      0.20084      0.19988      0.19926
      0.19895      0.19915      0.19963      0.20139      0.20088

和另外两个包含置信区间下限和上限的类似矩阵：

>> ci(:,:,1)     %# CI lower bound
>> ci(:,:,2)     %# CI upper bound

我使用以下函数从一组序列计算转换矩阵：

function trans = countFcn(seqs)
    %# accumulate transition matrix from all sequences
    trans = zeros(5,5);
    for i=1:numel(seqs)
        trans = trans + sparse(seqs{i}(1:end-1), seqs{i}(2:end), 1, 5,5);
    end

    %# normalize into proper probabilities
    trans = bsxfun(@rdivide, trans, sum(trans,2));
end

作为奖励，我们可以使用bootstrp函数来获取从每个bootstrap样本计算的统计量，我们用它来显示转换矩阵中每个条目的直方图：

%# compute multiple transition matrices using bootstrapping
stat = bootstrp(Nboot, @countFcn, sequences);

%# display histogram for each entry in the transition matrix
sub = reshape(1:5*5,5,5);
figure
for i=1:size(stat,2)
    subplot(5,5,sub(i))
    hist(stat(:,i))
end

bootstrap_histograms

Answer 2

不确定它是否具有统计学上的声音，但却是获得指示性上限和下限的简单方法：

将样本切成n个相等的片段（例如1：40,41：80，...，361：400）并计算每个子样本的概率矩阵。

通过查看子样本中概率的分布，你应该很清楚方差是什么。

这种方法的缺点是可能无法实际计算出具有所需给定概率的区间。它的优点在于它可以让您对系列的行为方式有一个良好的感觉，并且它可以捕获一些可能由于其他方法（例如引导）所基于的假设而在其他方法中丢失的信息。

估计马尔可夫转移矩阵的置信区间

2 个答案: