Question

我对grep的行为相当缺乏经验。我有一堆XML文件包含这样的行：

<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>
<identifier type="abc">abc:def.ghi/g5678m.ab678901</identifier>

我希望在斜杠之后获取标识符部分，并使用RegexPal构建正则表达式：

[a-z]\d{4}[a-z]*\.[a-z]*\d*

它突出了我想要的一切。完善。现在当我在同一个文件上运行grep时，我没有得到任何结果。正如我所说，我真的不太了解grep，所以我尝试了所有不同的组合。

grep [a-z]\d{4}[a-z]*\.[a-z]*\d* test.xml
grep "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
egrep "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
grep '[a-z]\d{4}[a-z]*\.[a-z]*\d*' test.xml
grep -E '[a-z]\d{4}[a-z]*\.[a-z]*\d*' test.xml

我做错了什么？

Answer 1

您的正则表达式与输入不匹配。让我们分解一下：

[a-z]匹配g
\d{4}匹配1234
[a-z]*与.

此外，我相信grep和家人不喜欢\d语法。尝试[0-9]或[:digit:]

最后，在使用正则表达式时，请将egrep更改为grep。我不记得确切的细节，但egrep支持更多的正则表达式运算符。此外，在许多shell中（包括你提到的OS X上的bash，使用单引号而不是双引号，否则在grep看到它之前，shell会将*扩展为当前目录中的文件列表（和其他shell元字符也将被扩展。）Bash不会触及单引号中的任何内容。

Answer 2

默认情况下，

grep不支持\d。要匹配数字，请使用[0-9]，或允许Perl兼容的正则表达式：

$ grep -P "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml

或：

$ egrep "[a-z][0-9]{4}[a-z]*\.[a-z]*[0-9]*" test.xml

Answer 3

grep使用“基本”正则表达式:(摘自手册页）

Basic vs Extended Regular Expressions
   In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their
   special meaning; instead use the backslashed versions \?, \+, \{,  \|,  \(,  and
   \).

   Traditional  egrep  did  not  support  the  {  meta-character,  and  some  egrep
   implementations support \{ instead,  so  portable  scripts  should  avoid  {  in
   grep -E patterns and should use [{] to match a literal {.

   GNU  grep -E  attempts  to  support  traditional usage by assuming that { is not
   special if it would be the start of  an  invalid  interval  specification.   For
   example,  the  command  grep -E '{1'  searches  for  the two-character string {1
   instead of reporting a syntax error in the regular expression.   POSIX.2  allows
   this behavior as an extension, but portable scripts should avoid it.

同样取决于您在'*'字符中执行的shell可能会扩展。

Answer 4

您可以使用以下命令：

$ cat file
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>

# Use -P option to enable Perl style regex \d.
$ grep -P  '[a-z]\d{4}[a-z]*\.[a-z]*\d*' file
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>

# to get only the part of the input that matches use -o option:
$ grep -P -o '[a-z]\d{4}[a-z]*\.[a-z]*\d*' file
g1234.ab012345

# You can use [0-9] inplace of \d and use -E option.
$ grep -E -o '[a-z][0-9]{4}[a-z]*\.[a-z]*[0-9]*' file
g1234.ab012345
$

Answer 5

试试这个：

[A-Z] \ d {5} [。] [A-Z] {2} \ d {6}

Answer 6

在grep中尝试这个表达式：

[a-z]\d{4}[a-z]*\.[a-z]*\d*

Answer 7

首先，不要使用regexp进行xml / html解析。看到这个经典的帖子 RegEx match open tags except XHTML self-contained tags

Grep没有显示结果，在线正则表达式测试仪确实如此

7 个答案: