确定满足汉明距离矩阵的字符串

时间:2014-12-04 19:16:50

标签: string algorithm matrix hamming-distance

我正在尝试从汉明距离矩阵创建一个字符串列表。每个字符串长度必须为20个字符,并带有4个字母的字母(A,B,C,D)。例如,假设我有以下汉明距离矩阵:

   S1 S2 S3
S1  0  5 12
S2  5  0 14
S3 12 14  0

从这个矩阵我需要创建3个字符串,例如:

S1 = "ABBBBAAAAAAAAAABBBBB"
S2 = "BAAAAAAAAAAAAAABBBBB"
S3 = "CBBBABBBBBBBBBBBBBBB"

我手动创建了这些字符串,但我需要为表示100个字符串的汉明距离矩阵执行此操作,这是手动操作不实际的。谁能建议一个可以做到这一点的算法?

谢谢,克里斯

1 个答案:

答案 0 :(得分:1)

这是一项有趣的练习。 : - )

以下octave脚本随机生成长度为n的{​​{1}}个字符串。随后它计算所有这些字符串之间的汉明距离。

接下来要做的是成对地比较字符串。例如,如果您搜索len,则会发现表格[5 12 14]包含N5之间的字符串以及12字符串和12分开。接下来的挑战当然是找到一个电路,其中145分开的电路可以与1212分开的电路放在一起。这样的方式,电路"关闭"。

% We generate n strings of length len
n=50;
len=20;

% We have a categorical variable of size 4 (ABCD)
cat=4;

% We want to generate strings that correspond with the following hamming distance matrix
search=[5 12 14];
%search=[10 12 14 14 14 16];
S=squareform(search);

% Note that we generate each string totally random. If you need small distances it makes sense to introduce 
% correlations across the strings
X=randi(cat-1,n,len);

% Calculate the hamming distances
t=pdist(X,'hamming')*len;

% The big matrix we have to find our little matrix S within
Y=squareform(t);

% All the following might be replaced by something like submatrix(Y,S) if that would exist
R=zeros(size(S),size(Y));
for j = 1:size(S)
    M=zeros(size(Y),size(S));
    for i = 1:size(Y)
        M(i,:)=ismember(S(j,:),Y(i,:));
    endfor
    R(j,:)=all(M');
endfor

[x,y]=find(R);

% A will be a set of cells that contains the indices of the columns/rows that will make up our submatrices
A = accumarray(x,y,[], @(v) {sort(v).'});

% If for example the distance 5 doesn't occur at all, we can already drop out
if (sum(cellfun(@isempty,A)) > 0)  
    printf("There are no matches\n");
    return
endif

% We are now gonna get all possible submatrices with the values in "search"
C = cell(1, numel(A));
[C{:}] = ndgrid( A{:} );

N = cell2mat( cellfun(@(v)v(:), C, 'UniformOutput',false) );
N = unique(sort(N,2), 'rows');

printf("Found %i potential matches (but contains duplicates)\n", size(N,1));

% We are now further filtering (remove duplicates)
[f,g]=mode(N,2);
h=g==1;
N=N(h,:);

printf("Found %i potential matches\n", size(N,1));

M=zeros(size(N),size(search,2));
for i = 1:size(N) 
    f=N(i,:);
    M(i,:)=squareform(Y(f,f))';
endfor

F=squareform(S)';

% For now we forget about wrong permutations, so for search > 3 you need to filter these out!
M = sort(M,2);
F = sort(F,2);

% Get the sorted search string out of the (large) table M
% We search for the ones that "close" the circuit
D=ismember(M,F,'rows');
mf=find(D);

if (mf) 
    matches=size(mf,1);
    printf("Found %i matches\n", matches);  
    for i = 1:matches
        r=mf(i);
        printf("We return match %i (only check permutations now)\n", r);
        t=N(r,:)';
        str=X(t,:);
        check=squareform(pdist(str,'hamming')*len);
        strings=char(str+64)
        check
    endfor
else
    printf("There are no matches\n");
endif

它将生成如下字符串:

ABAACCBCACABBABBAABA
ABACCCBCACBABAABACBA
CABBCBBBABCBBACAAACC