Question

在Matlab表的上下文中，我将脚趾浸入了Matlab的分类变量池中。实际上，我过去可能曾经漫游过该领土，但是如果是这样，那将是相对肤浅的事情。

这几天，我想使用Matlab代码模式来完成我通常在MS Access中所做的工作，例如，各种类型的联接和过滤。我的大部分数据都是分类的，并且我已经阅读了在表中使用分类变量的优点。但是，它们主要围绕描述性（超过枚举类型）和内存效率。我没有提到速度。分类变量是否具有速度优势？

我还想知道在进行各种类型的联接时使用分类变量是多么明智。分类变量将占据不同的表，因此，对于SQL ON子句（Matlab称为keys参数）中涉及的变量，我不清楚如何建立值的等效性。 / p>

由于缺少相关的Google热门歌曲，我似乎似乎处在新境界，对我而言这将是一件令人恐惧的事情。缺乏最佳实践的文档以及对反复试验和逆向工程的需求，需要花费我无法投入的更多时间，因此我会很遗憾地回到使用字符串的方式。

如果任何人都可以指向在线指导信息，我将不胜感激。

Answer 1

仅部分答案。...

以下测试表明，分类数据用作联接键时表现合理：

BigList = {'dog' 'cat' 'mouse' 'horse' 'rat'}'
SmallList = BigList( 1 : end-2 )

Nrows = 20;

% Create tables for innerjoin using strings

tBig = table( ...
    (1:Nrows)' , ...
    BigList( ceil( length(BigList) * rand( Nrows , 1 ) ) ) , ...
    'VariableNames' , {'B_ID' 'Animal'} )

tSmall = table( ...
    (1:Nrows)' , ...
    SmallList( ceil( length(SmallList) * rand( Nrows , 1 ) ) ) , ...
    'VariableNames' , {'S_ID' 'Animal'} )

tBigSmall = innerjoin( tBig , tSmall , 'Keys','Animal' );
tBig = sortrows( tBig , {'Animal','B_ID'} );
tSmall = sortrows( tSmall, {'Animal','S_ID'} );
tBigSmall = sortrows( tBigSmall, {'Animal' 'B_ID' 'S_ID'} );

% Now innerjoin the same tables using categorized strings

tcBig = tBig;
tcBig.cAnimal = categorical( tcBig.Animal );
tcBig.Animal = [];

tcSmall = tSmall;
tcSmall.cAnimal = categorical( tcSmall.Animal );
tcSmall.Animal = [];

tcBigSmall = innerjoin( tcBig , tcSmall , 'Keys','cAnimal' );
tcBig = sortrows( tcBig , {'cAnimal','B_ID'} );
tcSmall = sortrows( tcSmall, {'cAnimal','S_ID'} );
tcBigSmall = sortrows( tcBigSmall, {'cAnimal' 'B_ID' 'S_ID'} );

% Check if the join results are the same

if all( tBigSmall.Animal == tcBigSmall.cAnimal )
    disp('categorical vs string key: inner joins MATCH.')
else
    disp('categorical vs string key: inner joins DO NOT MATCH.')
end % if

所以现在唯一的问题是速度。这是一个普遍的问题，不仅是关于联接的问题，因此我不确定什么会是一个很好的测试。可能有很多可能性，例如，表行数，类别数，是联接还是过滤等。

无论如何，我相信这两个问题的答案都会得到更好的记录。

Matlab分类表变量：速度？在联接键中使用？

1 个答案: