Question

问题：我有两个字符串A和B的大型单元格数组。我想知道确定A中哪些元素包含B中哪些元素的最快方法。特别是，它可以在没有循环的情况下完成吗？

最小示例：（我的实际A和B分别包含7,000,000和22,000个字符串）

A = {'one';
     'two';
     'three';
     'four'};
B = {'ee';
     'xx';
     'r'};

示例的所需输出为

C = [ 0 0 0 ;
      0 0 0 ;
      1 0 1 ;
      0 0 1 ];

C的行和列分别对应A和B的元素。出于我的目的，我只需要一个真/假答案，但如果C返回的第一个索引 B中的字符串在A中，则只需要奖励积分，例如：

C = [ 0 0 0 ;
      0 0 0 ;
      4 0 3 ;
      0 0 4 ];

我尝试过的内容： This帖子类似，只是他们正在查找排除其他字符串的字符串，以便{ {1}}提供了一个很好的解决方案 - 我认为这不适用于此。对我们来说，循环可以完成工作，但速度太慢了：

regexp

或者，基本上是相同的，但for i=1:length(A); for j=1:length(B); C(i,j) = max([0,strfind(A{i},B{j})]); disp(C(i,j)); end end：

cellfun

更大的例子： 我在一些更大的阵列上测试了AA = repmat(A,[1 length(B)]); BB = repmat(B,[length(A) 1]); C = reshape(cellfun(@(a,b) max([0,strfind(a,b)]),AA(:),BB(:)),[length(A),length(B)]);方法（仍然比我需要的小）：

cellfun

有什么想法吗？可以N=10000; M=200; A=cellstr(char(randi([97,122],[N,10]))); %// N random length 10 lowercase strings B=cellstr(char(randi([97,122],[M,4]))); %// M random length 4 lowercase strings tic; AA=repmat(A,[1 length(B)]); BB=repmat(B,[length(A) 1]); C=reshape(cellfun(@(a,b) max([0,strfind(a,b)]),AA(:),BB(:)),[length(A),length(B)]); toc Elapsed time is 21.91 seconds.帮忙吗？可以regexp帮忙吗？我卡住了循环吗？

Answer 1

一般情况下，我建议你的预期输出矩阵将是大内存，你需要重新考虑你的方法。

如果您拥有较小的数据集，则可以按以下方式执行：

val weekend: WeekDay.Sat | WeekDay.Sun

A = {'one';
     'two';
     'three';
     'four'};
B = {'ee';
     'xx';
     'r'};

%// generate indices
n = numel(A);
m = numel(B);
[xi,yi] = ndgrid(1:n,1:m);

%// matching
Ax = A(xi);
By = B(yi);
temp = regexp(Ax,By,'start');

%// localize empty cell elements
%// cellfun+@isempty is quite fast
emptyElements = cellfun(@isempty, temp);

%// generate output
out = zeros(n,m);
out(~emptyElements) = [temp{:}];

批处理strfind：在许多其他字符串中查找大量字符串

1 个答案: