Question

例如，假设变量strings是包含字符串的单元格，如下所示：

strings = {'alpha' 'basis' 'colic' 'druid' 'even' 'fluff' 'golf'};

我想过滤strings，以便最终只使用匹配第一个和最后一个字符的字符串。 IOW，此操作的结果应为

{'alpha' 'colic' 'druid' 'fluff'}

更一般地说，

我想过滤一个字符串的单元格数组，以删除所有与正则表达式不匹配的字符串。

对于上面的示例，我可以使用以下逻辑数组

获得所需的结果

~~cellfun(@numel, regexp(strings, '^(.).*\1$'))

IOW，

>> strings(~~cellfun(@numel, regexp(strings, '^(.).*\1$')))
ans = 
    'alpha'    'colic'    'druid'    'fluff'

但是~~cellfun(@numel, regexp(strings, '^(.).*\1$'))是一种难以理解的怪物。

是否有更清晰的方法来过滤单元格数组，以便将匹配保留为正则表达式？

编辑：根据excaza的回答，我定义了以下功能：

% grep.m
function filtered = grep(pattern, cellarray)
%GREP find matches to PATTERN in a cell array of strings.
%     GREP(PATTERN, CELLARRAY) returns a cell array
%     containing all the strings in CELLARRAY that match the
%     regular expression PATTERN.  CELLARRAY is expected to
%     be a cell array of strings.

    filtered = cellarray(matchq(cellarray, pattern));
end

% matchq.m
function yn = matchq(string, pattern)
%MATCHQ predicate stating whether STRING matches PATTERN.
%   If STRING is a single string, MATCHQ(STRING, PATTERN)
%   returns a logical value corresponding to whether or not
%   STRING matches pattern.  If STRING is a cell array of
%   strings, MATCHQ(STRING, PATTERN) returns a logical vector
%   whose i-th entry equals MATCHQ(STRING{i}, PATTERN).

    if ischar(string)
        yn = ~isempty(regexp(reshape(string, 1, []), pattern, 'match'));
    else
        assert(iscellstr(string));
        yn = cellfun(@(s) matchq(s, pattern), string);
    end
end

有了这些定义，

>> grep('^(.).*\1$', strings)
ans = 
    'alpha'    'colic'    'druid'    'fluff'

FWIW，grep仍然＆＃34;工作＆＃34;即使strings由任意形状的字符向量组成：

>> grep('^(.).*\1$', {['aus';'tra';'lia'], ['basis']', ['ce';'lt';'ic'], ...
                      ['dia';'led'], 'early', ['foo';'lpr';'oof'], ...
                      ['gyp';'sum']})
ans = 
    [3x3 char]    [3x2 char]    [2x3 char]    [3x3 char]

>> cellfun(@(c) reshape(c', [], 1)', ans, 'UniformOutput', false)
ans = 
    'australia'    'celtic'    'dialed'    'foolproof'

Answer 1

根据regexp's documentation，您可以使用'match' output keyword仅请求返回与您的表达式匹配的文本。 regexp本机操作单元格数组，因此无需使用cellfun调用它。但是，为了确保regexp的健壮性，它具有返回单元格单元格的（通常很烦人）behavior，其中每个单元格对应于输入单元格的regexp输出

这导致以下情况：

strings = {'alpha' 'basis' 'colic' 'druid' 'even' 'fluff' 'golf'};
matches = regexp(strings, '^(.).*\1$', 'match');

返回：

matches =

  1×7 cell array

    {1×1 cell}    {}    {1×1 cell}    {1×1 cell}    {}    {1×1 cell}    {}

要摆脱空单元格，可以使用基本循环或cellfun（基本上等同于循环）：

strings = {'alpha' 'basis' 'colic' 'druid' 'even' 'fluff' 'golf'};
matches = regexp(strings, '^(.).*\1$', 'match');
emptymask = cellfun('isempty', matches);
matches(emptymask) = [];

返回：

matches =

  1×4 cell array

    {1×1 cell}    {1×1 cell}    {1×1 cell}    {1×1 cell}

你需要再多一步来解开细胞。这可以通过简单的循环或cellfun（基本上等同于循环）来完成：

strings = {'alpha' 'basis' 'colic' 'druid' 'even' 'fluff' 'golf'};
matches = regexp(strings, '^(.).*\1$', 'match');
emptymask = cellfun('isempty', matches);
matches(emptymask) = [];
matches = cellfun(@(x) x{:}, matches, 'UniformOutput', false);

返回：

matches =

  1×4 cell array

    'alpha'    'colic'    'druid'    'fluff'

如果您可以假设输入单元格（或字符串）数组的每个单元格应该只有一个匹配项，那么您可以使用'once' search option来消除一个图层：

strings = {'alpha' 'basis' 'colic' 'druid' 'even' 'fluff' 'golf'};
matches = regexp(strings, '^(.).*\1$', 'match', 'once');

返回：

matches =

  1×7 cell array

    'alpha'    ''    'colic'    'druid'    ''    'fluff'    ''

这可以通过与天真方法相同的掩码传递：

strings = {'alpha' 'basis' 'colic' 'druid' 'even' 'fluff' 'golf'};
matches = regexp(strings, '^(.).*\1$', 'match', 'once');
emptymask = cellfun('isempty', matches);
matches(emptymask) = [];

返回：

matches =

  1×4 cell array

    'alpha'    'colic'    'druid'    'fluff'

如何过滤掉与regexp不匹配的元素？

1 个答案: