我对grep的行为相当缺乏经验。我有一堆XML文件包含这样的行:
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>
<identifier type="abc">abc:def.ghi/g5678m.ab678901</identifier>
我希望在斜杠之后获取标识符部分,并使用RegexPal构建正则表达式:
[a-z]\d{4}[a-z]*\.[a-z]*\d*
它突出了我想要的一切。完善。现在当我在同一个文件上运行grep时,我没有得到任何结果。正如我所说,我真的不太了解grep,所以我尝试了所有不同的组合。
grep [a-z]\d{4}[a-z]*\.[a-z]*\d* test.xml
grep "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
egrep "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
grep '[a-z]\d{4}[a-z]*\.[a-z]*\d*' test.xml
grep -E '[a-z]\d{4}[a-z]*\.[a-z]*\d*' test.xml
我做错了什么?
答案 0 :(得分:8)
您的正则表达式与输入不匹配。让我们分解一下:
[a-z]
匹配g
\d{4}
匹配1234
[a-z]*
与.
此外,我相信grep
和家人不喜欢\d
语法。尝试[0-9]
或[:digit:]
最后,在使用正则表达式时,请将egrep
更改为grep
。我不记得确切的细节,但egrep
支持更多的正则表达式运算符。此外,在许多shell中(包括你提到的OS X上的bash,使用单引号而不是双引号,否则在grep看到它之前,shell会将*
扩展为当前目录中的文件列表(和其他shell元字符也将被扩展。)Bash不会触及单引号中的任何内容。
答案 1 :(得分:5)
grep
不支持\d
。要匹配数字,请使用[0-9]
,或允许Perl兼容的正则表达式:
$ grep -P "[a-z]\d{4}[a-z]*\.[a-z]*\d*" test.xml
或:
$ egrep "[a-z][0-9]{4}[a-z]*\.[a-z]*[0-9]*" test.xml
答案 2 :(得分:2)
grep使用“基本”正则表达式:(摘自手册页)
Basic vs Extended Regular Expressions
In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their
special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and
\).
Traditional egrep did not support the { meta-character, and some egrep
implementations support \{ instead, so portable scripts should avoid { in
grep -E patterns and should use [{] to match a literal {.
GNU grep -E attempts to support traditional usage by assuming that { is not
special if it would be the start of an invalid interval specification. For
example, the command grep -E '{1' searches for the two-character string {1
instead of reporting a syntax error in the regular expression. POSIX.2 allows
this behavior as an extension, but portable scripts should avoid it.
同样取决于您在'*'字符中执行的shell可能会扩展。
答案 3 :(得分:1)
您可以使用以下命令:
$ cat file
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>
# Use -P option to enable Perl style regex \d.
$ grep -P '[a-z]\d{4}[a-z]*\.[a-z]*\d*' file
<identifier type="abc">abc:def.ghi/g1234.ab012345</identifier>
# to get only the part of the input that matches use -o option:
$ grep -P -o '[a-z]\d{4}[a-z]*\.[a-z]*\d*' file
g1234.ab012345
# You can use [0-9] inplace of \d and use -E option.
$ grep -E -o '[a-z][0-9]{4}[a-z]*\.[a-z]*[0-9]*' file
g1234.ab012345
$
答案 4 :(得分:0)
试试这个:
[A-Z] \ d {5} [。] [A-Z] {2} \ d {6}
答案 5 :(得分:0)
在grep中尝试这个表达式:
[a-z]\d{4}[a-z]*\.[a-z]*\d*
答案 6 :(得分:-1)
首先,不要使用regexp进行xml / html解析。看到这个经典的帖子 RegEx match open tags except XHTML self-contained tags