如何过滤掉与regexp不匹配的元素?

时间:2016-10-24 15:49:31

标签: regex matlab

例如,假设变量strings是包含字符串的单元格,如下所示:

strings = {'alpha' 'basis' 'colic' 'druid' 'even' 'fluff' 'golf'};

我想过滤strings,以便最终只使用匹配第一个和最后一个字符的字符串。 IOW,此操作的结果应为

{'alpha' 'colic' 'druid' 'fluff'}

更一般地说,

  

我想过滤一个字符串的单元格数组,以删除所有与正则表达式不匹配的字符串。

对于上面的示例,我可以使用以下逻辑数组

获得所需的结果
~~cellfun(@numel, regexp(strings, '^(.).*\1$'))

IOW,

>> strings(~~cellfun(@numel, regexp(strings, '^(.).*\1$')))
ans = 
    'alpha'    'colic'    'druid'    'fluff'

但是~~cellfun(@numel, regexp(strings, '^(.).*\1$'))是一种难以理解的怪物。

是否有更清晰的方法来过滤单元格数组,以便将匹配保留为正则表达式?

编辑:根据excaza的回答,我定义了以下功能:

% grep.m
function filtered = grep(pattern, cellarray)
%GREP find matches to PATTERN in a cell array of strings.
%     GREP(PATTERN, CELLARRAY) returns a cell array
%     containing all the strings in CELLARRAY that match the
%     regular expression PATTERN.  CELLARRAY is expected to
%     be a cell array of strings.

    filtered = cellarray(matchq(cellarray, pattern));
end

% matchq.m
function yn = matchq(string, pattern)
%MATCHQ predicate stating whether STRING matches PATTERN.
%   If STRING is a single string, MATCHQ(STRING, PATTERN)
%   returns a logical value corresponding to whether or not
%   STRING matches pattern.  If STRING is a cell array of
%   strings, MATCHQ(STRING, PATTERN) returns a logical vector
%   whose i-th entry equals MATCHQ(STRING{i}, PATTERN).

    if ischar(string)
        yn = ~isempty(regexp(reshape(string, 1, []), pattern, 'match'));
    else
        assert(iscellstr(string));
        yn = cellfun(@(s) matchq(s, pattern), string);
    end
end

有了这些定义,

>> grep('^(.).*\1$', strings)
ans = 
    'alpha'    'colic'    'druid'    'fluff'

FWIW,grep仍然"工作"即使strings由任意形状的字符向量组成:

>> grep('^(.).*\1$', {['aus';'tra';'lia'], ['basis']', ['ce';'lt';'ic'], ...
                      ['dia';'led'], 'early', ['foo';'lpr';'oof'], ...
                      ['gyp';'sum']})
ans = 
    [3x3 char]    [3x2 char]    [2x3 char]    [3x3 char]

>> cellfun(@(c) reshape(c', [], 1)', ans, 'UniformOutput', false)
ans = 
    'australia'    'celtic'    'dialed'    'foolproof'

1 个答案:

答案 0 :(得分:2)

根据regexp's documentation,您可以使用'match' output keyword仅请求返回与您的表达式匹配的文本。 regexp本机操作单元格数组,因此无需使用cellfun调用它。但是,为了确保regexp的健壮性,它具有返回单元格单元格的(通常很烦人)behavior,其中每个单元格对应于输入单元格的regexp输出

这导致以下情况:

strings = {'alpha' 'basis' 'colic' 'druid' 'even' 'fluff' 'golf'};
matches = regexp(strings, '^(.).*\1$', 'match');

返回:

matches =

  1×7 cell array

    {1×1 cell}    {}    {1×1 cell}    {1×1 cell}    {}    {1×1 cell}    {}

要摆脱空单元格,可以使用基本循环或cellfun(基本上等同于循环):

strings = {'alpha' 'basis' 'colic' 'druid' 'even' 'fluff' 'golf'};
matches = regexp(strings, '^(.).*\1$', 'match');
emptymask = cellfun('isempty', matches);
matches(emptymask) = [];

返回:

matches =

  1×4 cell array

    {1×1 cell}    {1×1 cell}    {1×1 cell}    {1×1 cell}

你需要再多一步来解开细胞。这可以通过简单的循环或cellfun(基本上等同于循环)来完成:

strings = {'alpha' 'basis' 'colic' 'druid' 'even' 'fluff' 'golf'};
matches = regexp(strings, '^(.).*\1$', 'match');
emptymask = cellfun('isempty', matches);
matches(emptymask) = [];
matches = cellfun(@(x) x{:}, matches, 'UniformOutput', false);

返回:

matches =

  1×4 cell array

    'alpha'    'colic'    'druid'    'fluff'

如果您可以假设输入单元格(或字符串)数组的每个单元格应该只有一个匹配项,那么您可以使用'once' search option来消除一个图层:

strings = {'alpha' 'basis' 'colic' 'druid' 'even' 'fluff' 'golf'};
matches = regexp(strings, '^(.).*\1$', 'match', 'once');

返回:

matches =

  1×7 cell array

    'alpha'    ''    'colic'    'druid'    ''    'fluff'    ''

这可以通过与天真方法相同的掩码传递:

strings = {'alpha' 'basis' 'colic' 'druid' 'even' 'fluff' 'golf'};
matches = regexp(strings, '^(.).*\1$', 'match', 'once');
emptymask = cellfun('isempty', matches);
matches(emptymask) = [];

返回:

matches =

  1×4 cell array

    'alpha'    'colic'    'druid'    'fluff'