我有一个单元格22124x1,它包含重复的值,我想知道这些值复制多少次及其索引
第一个单元格包含这些值Datacell =
'221853_s_at'
'221971_x_at'
'221971_x_at'
'221971_x_at'
'221971_x_at'
'222031_at'
'222031_at'
'31637_s_at'
'37796_at'
'38340_at'
符号单元格:
'OR1D4 '
' OR1D5'
' UTP14C'
'GTF2H2 '
'ZNF324B '
' LOC644504'
'JMJD7 '
'ZNF324B '
'JMJD7-PLA2G4B'
' OR2A5 '
'OR1D4 '
例如,我希望单元格1的输出像这样
ID duplicated index
'221853_s_at' 1 1
'221971_x_at' 4 {2:5,1}
我尝试使用unique,但它不起作用。任何帮助将受到高度赞赏
答案 0 :(得分:1)
d = { '221853_s_at'
'221971_x_at'
'221971_x_at'
'221971_x_at'
'221971_x_at'
'222031_at'
'222031_at'
'31637_s_at'
'37796_at'
'38340_at'};
[ids,ia,ic]=unique(d);
ids有唯一的字符串 ia具有对应于d内的唯一字符串的实例的索引 ic有一个索引,对应于ids中的条目在d
中的索引中[ncnt] = hist(ic,1:numel(ids)) - 1; % minus 1 since you only want duplicates
ncnt =
0 3 1 0 0 0
获取重复项的数量 ids =
'221853_s_at'
'221971_x_at'
'222031_at'
'31637_s_at'
'37796_at'
'38340_at'
ic具有索引的查找表..使用查找或逻辑索引
答案 1 :(得分:1)
在视觉上令人愉悦的事情中生成指数不一定是一项微不足道的练习。如果您认为d
已排序,则会更简单。
利用accumarray
的替代方案:
d = {'221853_s_at'; '221971_x_at'; '221971_x_at'; '221971_x_at'; '221971_x_at'; ...
'222031_at'; '222031_at'; '31637_s_at'; '37796_at'; '38340_at' ...
};
d = sort(d); % Sort to make indices easier
% Find unique strings and their locations
[uniquestrings, ~, stringbin] = unique(d);
counts = accumarray(stringbin, 1);
repeatidx = find(counts - 1 > 0);
repeatedstrings = uniquestrings(repeatidx);
repeatcounts = counts(repeatidx) - 1;
% Find where string repeats start
startidx = find([true; diff(stringbin) > 0]);
repeatstart = startidx(repeatidx);
repeatend = startidx(repeatidx + 1) - 1;
% Generate table, requires R2013b or newer
t = table(repeatedstrings, repeatcounts, repeatstart, repeatend, ...
'VariableNames', {'ID', 'Duplicated', 'StringStart', 'StringEnd'} ...
);
哪个收益率:
t =
ID Duplicated StringStart StringEnd
_____________ __________ ___________ _________
'221971_x_at' 3 2 5
'222031_at' 1 6 7