关于比赛选择的注意事项

Question

这个问题不仅适用于MATLAB用户 - 如果您在PSEUDOCODE中知道问题的答案，那么请随时留下您的答案！

我有两个表Ta和Tb，它们具有不同的行数和不同的列数。内容是所有单元格文本，但未来也可能包含单元格编号。

我想在以下规则集合下将这些表的内容合并在一起：

如果Ta(i,j)为空，则取Tb(i*,j*)的值，反之亦然。
如果两者都可用，则取Ta(i,j)的值（并可选择检查它们是否相同）。

棘手的部分但是我们没有唯一的行键，我们只有唯一的列键。请注意，我对i*和i进行了区分。原因是Ta中的行可以与Tb处于不同的索引，同样适用于列j*和j。其含义是：

我们首先需要确定Ta的哪一行对应于Tb的行，反之亦然。我们可以通过尝试交叉匹配表共享的任何列来实现此目的。但是，我们可能找不到匹配项（在这种情况下，我们不会将行与另一行合并）。

问题

我们如何以最有效的方式将这两个表的内容合并在一起？

以下是一些资源，可以更详细地解释这个问题：

1。使用Matlab示例：

Ta = cell2table({...
     'a1', 'b1', 'c1'; ...
     'a2', 'b2', 'c2'}, ...
      'VariableNames', {'A','B', 'C'})
Tb = cell2table({...
     'b2*', 'c2', 'd2'; ...
     'b3', 'c3', 'd3'; ...
     'b4', 'c4', 'd4'}, ...
      'VariableNames', {'B','C', 'D'})

结果表Tc应该是这样的：

Tc = cell2table({...
    'a1' 'b1' 'c1'   ''; ...
    'a2' 'b2' 'c2' 'd2'; ...
    ''   'b3' 'c3' 'd3'; ...
    ''   'b4' 'c4' 'd4'}, ...
     'VariableNames', {'A', 'B','C', 'D'})

2。可能的第一步

我尝试了以下内容：

Tc = outerjoin(Ta, Tb, 'MergeKeys', true)

哪个工作顺利，但问题是它缺少看似相似的行堆叠。例如。上面的命令产生：

 A        B       C       D  
____    _____    ____    ____
''      'b2*'    'c2'    'd2'
''      'b3'     'c3'    'd3'
''      'b4'     'c4'    'd4'
'a1'    'b1'     'c1'    ''  
'a2'    'b2'     'c2'    ''

这里的行

''      'b2*'    'c2'    'd2'
'a2'    'b2'     'c2'    ''

应该合并为一个：

'a2'    'b2'     'c2'    'd2'

所以我们需要再一步将这两者叠加在一起？

第3。障碍的例子

如果我们有类似的话：

Ta = 
     A        B       C       
    ____    _____    ____
    'a1'    'b1'     'c1' 
    'a2'    'b2'     'c2'

Tb = 
     A        B       C       
    ____    _____    ____
    'a1'    'b2'     'c3'

然后问题是b中的行是应该与a的第1行还是第2行合并，还是应该将所有行合并或者只是作为单独的行放置？关于如何处理这类情况的想法也很不错。

Answer 1

这是一个概念性的答案，可以帮助你解决问题：

定义一个'评分函数'，告诉你每行Tb与Ta中的一行有多好。
用Ta
对于Ta中的每一行，确定与Tb的最佳匹配。如果匹配质量高于您的标准，请将最佳匹配匹配定义为成功匹配。
如果找到成功匹配，则“消耗”它（使用来自Tb的信息以在需要的地方丰富Tc中的相应行）
继续前行，直到你到达Ta的末尾，Tb中没有消耗的东西现在可以“附加”到Tc。

改进空间：

关于比赛选择的注意事项

使用消耗Ta代替Tb，或使用更复杂的启发式来确定消费顺序（例如，计算所有'距离'并根据成本函数优化匹配）。

请注意，如果您在基本解决方案中对匹配产生大量误报，则这些改进只是必要的。

关于匹配质量定义的注释

我建议你从这开始非常简单，例如，如果你有4个字段，只需计算匹配的字段数，或者所有非空字段是否匹配。

如果你想更进一步，可以考虑评估这些值相隔多远（例如mse）或文本分开的距离（例如levensteihn距离）。

Answer 2

这是一个尝试完成工作的功能。您可以输入两个表，一个阈值，用于决定是否合并两行，还有一个逻辑表示您是否希望在合并冲突出现时从第一个表中获取值。我没有为极端情况做准备但是看看它在哪里：

TkeepAll=mergeTables(Tb,Ta,1,true)
TmergeSome=mergeTables(Tb,Ta,0.25,true)
TmergeAll=mergeTables(Tb,Ta,-1,true)

这是功能：

function Tmerged=mergeTables(Ta,Tb,threshold,preferA)
%% parameters
% Ta and Tb are two the two tables to merge
% threshold=0.25; minimal ratio of identical values in rows for merge.
%   example: you have one row in table A with 3 values, but you only have two
%   values for the same columns in data B. if one of the values is identical
%   and one isn't, you have ratio of 1/2 aka 0.5, which passes a threshold of
%   0.25
% preferA=true; which to take when there is merge conflict
%% see how well rows fit to each other
% T1 is the table with fewer rows
if size(Ta,1)<=size(Tb,1)
    T1=Ta;
    T2=Tb;
    prefer1=preferA;
else
    T1=Tb;
    T2=Ta;
    prefer1=~preferA;
end
[commonVar1,commonVar2]=ismember(T1.Properties.VariableNames,...
    T2.Properties.VariableNames);
commonVar1=find(commonVar1);
commonVar2(commonVar2==0)=[];
% fit is a table with the size of N rows T1 by M rows T2, with values
% describing what ratio of identical items between each row in
% table 1 (shorter) and each row in table 2 (longer), among all not-missing
% points
for ii=1:size(T1,1) %rows of T1
    for jj=1:size(T2,1)
        fit(ii,jj)=sum(ismember(T1{ii,commonVar1},T2{jj,commonVar2}))/length(commonVar1);
    end
end
%% pair rows according to fit
% match has two columns, first one has T1 row number and secone one has the
% matching T2 row number
unpaired1=true(size(T1,1),1);
unpaired2=true(size(T2,1),1);
count=0;
match=[];
maxv=max(fit,[],2);
[~,order]=sort(maxv,'descend');
order=order';
for ii=order %1:size(T1,1)
    [maxv,maxi]=max(fit,[],2);
    if maxv(ii)>threshold
        count=count+1;
        match(count,1)=ii;
        match(count,2)=maxi(ii);
        unpaired1(ii)=false;
        unpaired2(match(count,2))=false;
        fit(:,match(count,2))=nan; %exclude paired row from next pairing
    end
end

%% prepare new variables
% first variables common to the two tables
Nrows=sum(unpaired1)+sum(unpaired2)+size(match,1);
namesCommon={};
namesCommon(1:length(commonVar1))={T1.Properties.VariableNames{commonVar1}};
for vari=1:length(commonVar1)
    if isempty(match)
        mergedData={};
    else
        if prefer1
            mergedData=T1{match(:,1),commonVar1(vari)}; %#ok<*NASGU>
        else
            mergedData=T2{match(:,2),commonVar2(vari)};
        end
    end
    data1=T1{unpaired1,commonVar1(vari)};
    data2=T2{unpaired2,commonVar2(vari)};
    eval([namesCommon{vari},'=[data1;mergedData;data2];']);
end
% variables only in 1
uncommonVar1=1:size(T1,2);
uncommonVar1(commonVar1)=[];
names1={};
names1(1:length(uncommonVar1))={T1.Properties.VariableNames{uncommonVar1}};
for vari=1:length(uncommonVar1)
    data1=T1{:,uncommonVar1(vari)};
    tmp=repmat({''},Nrows-size(data1,1),1);
    eval([names1{vari},'=[data1;tmp];']);
end
% variables only in 2
uncommonVar2=1:size(T2,2);
uncommonVar2(commonVar2)=[];
names2={};
names2(1:length(uncommonVar2))={T2.Properties.VariableNames{uncommonVar2}};
for vari=1:length(uncommonVar2)
    data2=T2{:,uncommonVar2(vari)};
    tmp=repmat({''},Nrows-size(data2,1),1);
    eval([names2{vari},'=[tmp;data2];']);
end
%% collect variables to a table
names=sort([namesCommon,names1,names2]);
str='table(';
for vari=1:length(names)
    str=[str,names{vari},','];
end
str=[str(1:end-1),');'];
Tmerged=eval(str);

合并两个表的内容（查找Matlab或伪代码）

2 个答案:

关于比赛选择的注意事项

关于匹配质量定义的注释