我经常有一个包含列组合的表,它充当分组键/公共标识符,这样键可以跨行重复。一个简单的例子:
sampleId = [1 1 1 3 3 3]';
entity = [1 2 3 1 4 5]';
dataTable = table(sampleId, entity)
在这里,可以认为entity
的观察结果附在样本1和3上。
我发现压缩这些数据非常有用,因此密钥在行中是唯一的,例如,我想要一个看起来像这样的最终表:
----------------------------
| sampleId | entity |
----------------------------
| 1 | 3x1 table |
| 3 | 3x1 table |
----------------------------
我知道这样做的唯一方法是使用for循环,如下所示:
tempCell = cell(length(unique(dataTable.sampleId)), 1);
counter = 1;
nonGroupVariables = dataTable.Properties.VariableNames(...
~ismember(dataTable.Properties.VariableNames,'sampleId'));
for sampleId = unique(dataTable.sampleId)'
tempCell(counter) = {dataTable(dataTable.sampleId == sampleId, nonGroupVariables)};
counter = counter + 1;
end
newDataTable = table(unique(dataTable.sampleId), tempCell, 'VariableNames', ['sampleId', nonGroupVariables]);
有没有更好的方法(更有效/更快)实现这一目标,可能使用accummarray
或分组?
答案 0 :(得分:1)
您确实可以使用accumarray
。我将区分两种情况:
当然第二种情况包括第一种情况,但更容易考虑第一种情况,然后继续第二种情况。
sampleId = [1 1 1 3 3 3]';
sampleId2 = [1 1 2 3 2 2]';
entity = [1 2 3 1 4 5]'; %'
dataTable = table(sampleId, sampleId2, entity); %// example data
n = 2; %// number of grouping variables
[u, ~, v] = unique(dataTable{:,1:n}, 'rows');
c = accumarray(v, dataTable{:,n+1}, [], @(x) {x}); %// cell array of vectors,
%// where each vector refers to one value of the grouping variable
ut = mat2cell(u, size(u,1), ones(1,n)); %// convert to cell array
compressedTable = [table(ut{:}, 'VariableNames', dataTable.Properties.VariableNames(1:n)) ...
cell2table(c, 'VariableNames', dataTable.Properties.VariableNames(n+1))];
%// create output table with correct variable names
这会生成一个表
请注意,curly-bracket indexing into the table用于使代码与表变量名称无关。在上面的例子中,结果是
>> compressedTable
compressedTable =
sampleId sampleId2 entity
________ _________ ____________
1 1 [2x1 double]
1 2 [ 3]
3 2 [2x1 double]
3 3 [ 1]
>> compressedTable.entity{1}
ans =
2
1
>> compressedTable.entity{2}
ans =
3
>> compressedTable.entity{3}
ans =
4
5
>> compressedTable.entity{4}
ans =
1
在这种情况下,您可能需要循环除第一列之外的列。在下文中,我使用arrayfun
进行循环。
sampleId = [1 1 1 3 3 3]';
sampleId2 = [1 1 2 3 2 2]';
entity = [1 2 3 1 4 5]'; %'
entity2 = entity*2;
dataTable = table(sampleId, sampleId2, entity, entity2); %// example data
n = 2; %// number of grouping variables
[u, ~, v] = unique(dataTable{:,1:n}, 'rows');
c = arrayfun(@(n) accumarray(v, dataTable{:,n}, [], @(x) {x}), n+1:size(dataTable,2), ...
'uniformoutput', 0); %// cell array of cell arrays of vectors
ut = mat2cell(u, size(u,1), ones(1,n)); %// convert to cell array
compressedTable = [table(ut{:}, 'VariableNames', dataTable.Properties.VariableNames(1:n)) ...
cell2table([c{:}], 'VariableNames', dataTable.Properties.VariableNames(n+1:end))];
%// create output table with correct variable names
结果是
compressedTable =
sampleId sampleId2 entity entity2
________ _________ ____________ ____________
1 1 [2x1 double] [2x1 double]
1 2 [ 3] [ 6]
3 2 [2x1 double] [2x1 double]
3 3 [ 1] [ 2]
>> compressedTable.entity{1}
ans =
2
1
>> compressedTable.entity2{1}
ans =
4
2
>> compressedTable.entity{2}
ans =
3
>> compressedTable.entity2{2}
ans =
6
>> compressedTable.entity{3}
ans =
4
5
>> compressedTable.entity2{3}
ans =
8
10
>> compressedTable.entity{4}
ans =
1
>> compressedTable.entity2{4}
ans =
2
答案 1 :(得分:0)
我找到了另一种使用varfun
的方法:
compressedTable = varfun(@(x){x}, dataTable, 'GroupingVariables', 'sampleId');
compressedTable.GroupCount = [];
compressedTable.Properties.VariableNames = dataTable.Properties.VariableNames;