Question

有没有办法使用grep进行重音不敏感搜索，最好保留--color选项？我的意思是grep --secret-accent-insensitive-option aei会匹配àei，但也会äēì和可能的æi。

我知道我可以使用iconv -t ASCII//TRANSLIT删除文本中的重音，但我看不出如何使用它来匹配，因为文本被转换（它适用于grep -c或-l）

Answer 1

您正在寻找一大堆POSIX正则表达式equivalence classes：

14.3.6.2等价类运算符（[= … =]）

正则表达式识别列表中的等价类表达式。 等价类表达式是一组整理元素，它们都属于同一个等价类。通过在 open-equivalence-class operator 和 close-equivalence-class operator 之间放置一个collating元素，可以形成一个等价类表达式。 [=表示open-equivalence-class运算符，=]表示close-equivalence-class运算符。例如，如果a和A是等价类，则[[=a=]]和[[=A=]]都会匹配a和A。如果等价类表达式中的整理元素不是等价类的一部分，则匹配器会将等价类表达式视为整理符号。

我在下一行使用插入符号来指示实际着色的内容。我还调整了测试字符串以说明关于案例的观点。

$ echo "I match àei but also äēì and possibly æi" | grep '[[=a=]][[=e=]][[=i=]]'
I match àei but also äēì and possibly æi
        ^^^          ^^^

这匹配aei之类的所有字词。它与æi不匹配的事实应该提醒你，你要对你正在使用的正则表达式库中存在的任何映射感兴趣（可能是gnulib，这是我链接和引用的），尽管我认为digraphs很可能超出了最佳等价类映射的范围。

你不应该期望等价类是可移植的，因为它们太神秘了。

更进一步，如果你只需要重音字符，事情会变得复杂得多。我已将aei的请求更改为[aei]。

$ echo "I match àei but also äēì and possibly æi" | grep '[[=a=][=e=][=i=]]'
I match àei but also äēì and possibly æi
^  ^    ^^^     ^    ^^^ ^       ^     ^

清除它以避免非重音匹配需要等价类和前瞻/后视，而BRE（基本POSIX正则表达式）和ERE（扩展POSIX正则表达式）支持前者，它们都缺少后者。 Libpcre（grep -P和其他大多数人使用的perl兼容正则表达式的C库）和perl支持后者但缺少前者：

使用libpcre尝试＃1：grep：失败

$ echo "I match àei but also äēì and possibly æi" \
    | grep -P '[[=a=][=e=][=i=]](?<![aei])'
grep: POSIX collating elements are not supported

尝试＃2：perl本身：失败

$ echo "I match àei but also äēì and possibly æi" \
    | perl -ne 'print if /[[=a=][=e=][=i=]](?<![aei])/'
POSIX syntax [= =] is reserved for future extensions in regex; marked by <-- HERE in m/[[=a=][=e= <-- HERE ][=i=]](?<![aei])/ at -e line 1.

尝试＃3：python（它有自己的PCRE实现）：（无声）失败

$ echo "I match àei but also äēì and possibly æi" \
    | python -c 'import re, sys;
                 print re.findall(r"[[=a=][=e=][=i=]]", sys.stdin.read())'
[]

哇，一个正则表达式功能，PCRE，python，甚至perl 都不支持！这些并不多。（不要介意投诉在第二个等价类上，它仍然只是/[[=a=]]/抱怨。）这是等价类是神秘的进一步证据。

事实上，似乎没有任何 PCRE库能够进行等价类;关于equivalence classes at regular-expressions.info的部分仅声称实现POSIX标准的正则表达式库实际上有这种支持。 GNU grep最接近，因为它可以执行BRE，ERE和PCRE，但它无法将它们组合在一起。

所以我们将分两部分来完成。

尝试＃4：令人厌恶的诡计：成功

$ echo "I match àei but also äēì and possibly æi" \
    | grep --color=always '[[=a=][=e=][=i=]]' \
    | perl -pne "s/\e\[[0-9;]*m\e\[K(?i)([aei])/\$1/g"
I match àei but also äēì and possibly æi
        ^            ^^^

代码漫步：

grep强制启用颜色，以便perl可以键入颜色代码以记录匹配
${GREP_COLOR:-01;31}注明grep的颜色（默认为相同的亮红色）
perl的{{1}}命令匹配完整的颜色代码，然后匹配我们要从最终结果中删除的非重音字母。它用（无色）字母
s///正则表达式(?i)之后的任何内容都不区分大小写perl匹配[[=i=]]
I在完成perl -p执行

有关BRE与ERE与PCRE及其他人的更多信息，请参阅this StackExchange regex post或POSIX regexps at regular-expressions.info。有关每种语言差异的更多信息（包括libpcre vs python PCRE vs perl），请查看tools at regular-expressions.info。

2019更新：GNU Grep现在使用-e $GREP_COLORS，其ms=1;41优先于$GREP_COLOR 1;41。这很难提取（并且很难在两者之间进行调整），因此我在try＃4中修改了perl代码以找出任何 SGR color code，而不是只关注那些颜色grep会添加。有关上一段代码，请参阅revision 2 of this answer。

我目前无法验证Apple Mac OS X使用的BSD grep是否支持POSIX正则表达式等价类。

Answer 2

我不认为这可以在grep中完成，除非你愿意编写一个使用iconv和diff的shell脚本，这与你的看法有点不同。请求。

这是通过快速perl脚本非常接近您的请求：

#!/usr/bin/perl
# tgrep 0.1 Copyright 2014 by Adam Katz, GPL version 2 or later

use strict;
use warnings;
use open qw(:std :utf8);
use Text::Unidecode;

my $regex = shift or die "Missing pattern.\nUsage: tgrep PATTERN [FILE...]";

my $retval = 1;  # default to false (no hits)

while(<>) {
  my $line = "", my $hit = 0;
  while(/\G(\S*(?:\s+|$))/g){             # for each word (w/ trailing spaces)
    my $word = $1;
    if(unidecode($word) =~ qr/$regex/) {  # if there was a match
      $hit++;                             # note that fact
      $retval = 0;                        # final exit code will be 0 (true)
      $line .= "\e[1;31m$word\e[0;0m";    # display word in RED
    } else {
      $line .= $word;                     # display non-matching word normally
    }
  }
  print $line if $hit;                    # only display lines with matches
}

exit $retval;

Markdown不允许我制作红色文字，所以这里是带引号命中的输出：

$ echo "match àei but also äēì and possibly æi" | tgrep aei
match "àei" but also "äēì" and possibly "æi"

这将突出显示匹配的单词而不是实际匹配，如果不制作大量的字符类和/或组成零碎的正则表达式解析器，这将非常困难。因此，搜索模式“ae”而不是“aei”会产生相同的结果（在这种情况下）。

在这个玩具示例中没有复制grep的标志。我想保持简单。

Answer 3

对于我来说，使用来自php的grep（可以改编）比perl解决方案更快。

Strtolower你的查询字符串没有重音，然后用他们的重音形式替换一些字母，grep -i用于不区分情况的研究（注意$ q中的引号）：

// Your query string
$q = 'Maxime Bernié';

$accents = array(
    'a' => '[aáàâäãå]',
    'e' => '[eéèêë]',
    'i' => '[iíìîï]',
    'o' => '[oóòôöõ]',
    'u' => '[uúùûü]',
    'c' => '[cç]',
    'n' => '[nñ]',
    'y' => '[ýÿ]'
);

$q = remove_accents(strtolower($q));
$qa = str_split($q);

foreach ($qa as $k => $v) {
    if (isset($accents[$v])) {
        $qa[$k] = $accents[$v];
    }
}

$q = implode('', $qa);

echo system('cat file.txt | grep -i "'.$q.'"');

function remove_accents($str, $charset='utf-8')
{
    $str = htmlentities($str, ENT_NOQUOTES, $charset);

    $str = preg_replace('#&([A-za-z])(?:acute|cedil|caron|circ|grave|orn|ring|slash|th|tilde|uml);#', '\1', $str);
    $str = preg_replace('#&([A-za-z]{2})(?:lig);#', '\1', $str);
    $str = preg_replace('#&[^;]+;#', '', $str);

    return $str;
}

如何做一个重音不敏感的grep？

3 个答案: