Question

我是Matlab的新手，我真的很感激一些帮助

我有

A size{6602,1} = 

[107;302;306;601;1014;1014;6016;6016;6016]
26x1 double
26x1 double
[1016;1019;6014]
69x1 double
[201;201;301;301;301;1012;1015]
1013
[301;406;507;508;1014;1016;5011;6014]
401

.....等等

我想对具有X个共同元素的索引进行分组（我的第一次迭代我将从2个常见元素开始，第二次迭代使用8个常见元素，依此类推）

这是数据的一个例子： screenShot 我要找的结果是：222,229或colum2中的值（164802771,167884647）

显示三行共享三个值的数据示例结果：3 1 8 16 ..其中3是共享值的数量，其余是行号

提前致谢

Answer 1

您提供的解释远非明确。我知道你想要A中元素的索引，它具有一定数量的重复值（es：2）。因此，在您的示例中：[201; 201; 301; 301; 301; 1012; 1015]，重复次数为3.1次，201次，2次为301次。它是否正确？那个例子就是代码：

numberOfCommonElements = 2;
index = cellfun(@(mat) sum(diff(sort(mat)) == 0) == numberOfCommonElements,A);

Answer 2

我有一个矢量化解决方案，根据第1列中有多少独特元素以及它们出现在不同行中的频率，这可能是也可能不是最有效的。我将使用你问题中的这个简单的6行示例：

A = {[107; 302; 306; 601; 1014; 1014; 6016; 6016; 6016]; ...
     [1016; 1019; 6014]; ...
     [201; 201; 301; 301; 301; 1012; 1015]; ...
     1013; ...
     [301; 406; 507; 508; 1014; 1016; 5011; 6014]; ...
     401};

我们首先使用unique查找A中所有单元格中出现的唯一数值：

uniqueValues = unique(vertcat(A{:}));

接下来，我们可以使用cellfun和ismember为每个唯一值查找它出现的行：

memberIndex = cellfun(@(c) {ismember(uniqueValues, c)}, A);
memberIndex = [memberIndex{:}];

memberIndex =

  19×6 logical array

   1   0   0   0   0   0
   0   0   1   0   0   0
   0   0   1   0   1   0
   1   0   0   0   0   0
   1   0   0   0   0   0
   0   0   0   0   0   1
   0   0   0   0   1   0
   0   0   0   0   1   0
   0   0   0   0   1   0
   1   0   0   0   0   0
   0   0   1   0   0   0
   0   0   0   1   0   0
   1   0   0   0   1   0
   0   0   1   0   0   0
   0   1   0   0   1   0
   0   1   0   0   0   0
   0   0   0   0   1   0
   0   1   0   0   1   0
   1   0   0   0   0   0

对于memberIndex，行数是唯一值的数量，列数是A中的行数。第一个唯一值仅出现在第1行，第二个唯一值仅出现在第3行，等等。请注意，很多值只出现在A的一行中，并且因为您正在寻找对于多行是常见的，我们可以从分析中删除它们：

repeatedIndex = (sum(memberIndex, 2) > 1);
uniqueValues = uniqueValues(repeatedIndex)

uniqueValues =

         301
        1014
        1016
        6014

memberIndex = memberIndex(repeatedIndex, :);

memberIndex =

  4×6 logical array

   0   0   1   0   1   0
   1   0   0   0   1   0
   0   1   0   0   1   0
   0   1   0   0   1   0

请注意，此示例中只有4个值显示在A的多行中。

现在，事情变得棘手了。请注意，如果我们从上面的memberIndex获取任意一对列，将它们逐个元素相乘，并对结果求和，我们将得到它们之间的公共元素的总数。这个可以对所有列的组合进行，这个矩阵乘以：

counts = (memberIndex.')*memberIndex;

但是，示例问题只有6行，而实际数据超过6,000行。这会使counts成为巨大的矩阵，这将是一个巨大的内存和计算负担。我们可以通过将memberIndex转换为稀疏矩阵来缓解这种情况，但如果您的实际数据在很多行中出现的相对较少的唯一值，您仍可能会遇到问题。这是一种方法，假设一个更好的情况：

memberIndex = sparse(double(memberIndex));
[rowIndex, colIndex, count] = find(triu(memberIndex.'*memberIndex, 1));

为了解释，我们执行稀疏矩阵乘法，用triu提取上对角线（因为矩阵是对称的，我们只需要一半，不计算主对角线），然后找到所有非零项及其行和列索引。我们可以按照以下方式收集和排序您想要的结果：

result = sortrows([count rowIndex colIndex], 1);

result =

     1     1     5
     1     3     5
     2     2     5

这告诉我们在第1行和第5行之间共享1个值（值1014），在第3行和第5行之间共享1个值（值301），在第2行和第5行之间共享2个值（值1016）和6014）。

通过不同地执行矩阵乘法可以获得一些效率增益，这可能是您需要的实际数据，但这里是上面提到的整个计算的压缩版本：

uniqueValues = unique(vertcat(A{:}));
memberIndex = cellfun(@(c) {ismember(uniqueValues, c)}, A);
memberIndex = [memberIndex{:}];
memberIndex = sparse(double(memberIndex(sum(memberIndex, 2) > 1, :)));
[rowIndex, colIndex, count] = find(triu(memberIndex.'*memberIndex, 1));
result = sortrows([count rowIndex colIndex], 1);

在行向量的单元格中具有至少2个公共元素的组索引

2 个答案: