Awk / Gsub替代特殊字符和字符串提取

时间:2017-08-25 17:24:36

标签: bash perl awk sed

我有一个包含许多行的文件,如下所示:

<li><img src="img/tt_potato-30x30.png" alt="ew_inactive"> <img src="img/in-event-40x40.png" alt="event"> - dep[(0:0)(0:0)]ref[(3:0)(0:0)]srch[?] - <a href "tcc_1111.html">XX:The quick brown fox jumped over the lazy  </a>  -<img src= "img/config-40x40.png" alt="config"><img src="img/validate-40x50.png"  alt="validate"> - user

<li><img src="img/tt_potato-30x30.png" alt="ew_inactive"> <img src="img/in-event-40x40.png" alt="event"> - dep[(0:0)(0:0)]ref[(3:0)(0:0)]srch[?] - <a href "tcc_1111.html">YY:Jack and Jill went up the hill  </a>  -<img src= "img/config-40x40.png" alt="config"><img src="img/validate-40x50.png"  alt="validate"> - user

<li><img src="img/tt_potato-30x30.png" alt="ew_inactive"> <img src="img/in-event-40x40.png" alt="event"> - dep[(0:0)(0:0)]ref[(3:0)(0:0)]srch[?] - <a href "tcc_1111.html">ZZ: Mary had a little lamb  </a>  -<img src= "img/config-40x40.png" alt="config"><img src="img/validate-40x50.png"  alt="validate"> - user

我希望提取以下字符串并丢弃其他所有内容。

XX: The quick brown fox jumped over the lazy
YY: Jack and Jill went up the hill
ZZ: Mary had a little lamb

到目前为止,我已尝试使用以下awk命令,但由于XX需要替换为YY和ZZ,因此它似乎有限。

awk '{gsub(/^.*XX:/,"XX:"); gsub(/[<\a>].*$/,"[</a>].");print}'

有没有其他人可以建议使用任何其他标准Linux工具? 感谢。

3 个答案:

答案 0 :(得分:1)

如果您的Input_file与显示的示例相同,那么以下内容也可以帮助您。

awk -F"\">|</a>" 'NF{print $4}'  Input_file

说明:将"></a>作为字段分隔符(显然要获得OP需要:))。 NF将确保我们应该跳过空行。现在,当我们将字段分隔符设置为2时,我们可以看到第4个字段将是OP所需的字段,这里是我们如何看到所有字段的值,我们可以选择OP需要获得的第4列。 / p>

awk -F"\">|</a>" '{for(i=1;i<=NF;i++){print i,$i}}'  Input_file
1 <li><img src="img/tt_potato-30x30.png" alt="ew_inactive
2  <img src="img/in-event-40x40.png" alt="event
3  - dep[(0:0)(0:0)]ref[(3:0)(0:0)]srch[?] - <a href "tcc_1111.html
4 XX:The quick brown fox jumped over the lazy
5   -<img src= "img/config-40x40.png" alt="config
6 <img src="img/validate-40x50.png"  alt="validate
7  - user
1 <li><img src="img/tt_potato-30x30.png" alt="ew_inactive
2  <img src="img/in-event-40x40.png" alt="event
3  - dep[(0:0)(0:0)]ref[(3:0)(0:0)]srch[?] - <a href "tcc_1111.html
4 YY:Jack and Jill went up the hill
5   -<img src= "img/config-40x40.png" alt="config
6 <img src="img/validate-40x50.png"  alt="validate
7  - user
1 <li><img src="img/tt_potato-30x30.png" alt="ew_inactive
2  <img src="img/in-event-40x40.png" alt="event
3  - dep[(0:0)(0:0)]ref[(3:0)(0:0)]srch[?] - <a href "tcc_1111.html
4 ZZ: Mary had a little lamb
5   -<img src= "img/config-40x40.png" alt="config
6 <img src="img/validate-40x50.png"  alt="validate
7  - user

我希望这会有所帮助。

答案 1 :(得分:0)

我想,这个perl单线程会做(看起来你在linux上):

perl -lne 'print $1 if m{>((XX|YY|ZZ):[^<]*)}'

答案 2 :(得分:0)

^.XX表示any character followed by XX at the start of a line - 它与XX中线不匹配。 [<\a>]表示any of the characters <, \, a, or > - 它与字符串<\a>不匹配。找一个正则表达式教程......

你的问题不明确但也许这就是你想要做的事情?

$ awk '{sub(/<\/a>.*/,""); sub(/.*>/,"")} NF' file
XX:The quick brown fox jumped over the lazy
YY:Jack and Jill went up the hill
ZZ: Mary had a little lamb

或者使用GNU awk为第3个arg匹配()以打印...之间的任何内容...(假设每行一个):

$ awk 'match($0,/.*<a[^>]*>(.*)<\/a>.*/,a){print a[1]}' file
XX:The quick brown fox jumped over the lazy
YY:Jack and Jill went up the hill
ZZ: Mary had a little lamb

在任何sed中都是:

$ sed -n 's/.*<a[^>]*>\(.*\)<\/a>.*/\1/p' file
XX:The quick brown fox jumped over the lazy
YY:Jack and Jill went up the hill
ZZ: Mary had a little lamb