正则表达式不能正确匹配“ <p> </p>”

时间:2019-05-22 21:53:31

标签: regex shell

所有人。

使用正则表达式从HTML中提取文本时遇到了一些困难,

</p>

我正在使用unsung hero.*</p>重复显示我感兴趣的段落,但是直到下一个</p>才能使它匹配

我使用的命令是:

egrep "unsung hero.*</p>" test

正在测试的网页是:

<p>There are going to be outliers among us, people with extraordinary skill at recognizing faces. Some of them may end up as security officers or gregarious socialites or politicians. The rest of us are going to keep smiling awkwardly at office parties at people we\'re supposed to know. It\'s what happens when you stumble around in the 21st century with a mind that was designed in the Stone Age.</p>\n    <p>(SOUNDBITE OF MUSIC)</p>\n    <p>VEDANTAM: This week\'s show was produced by Chris Benderev and edited by Jenny Schmidt. Our supervising producer is Tara Boyle. Our team includes Renee Cohen, Parth Shah, Laura Kwerel, Thomas Lu and Angus Chen.</p>\n    <p>Our unsung hero this week is Alexander Diaz, who troubleshoots technical problems whenever they arise and has the most unflappable, kind disposition in the face of whatever crisis we throw his way. Producers at NPR have taken to calling him Batman because he\'s constantly, silently, secretly saving the day. Thanks, Batman.</p>\n    <p>If you like today\'s episode, please take a second to share it with a friend. We\'re always looking for new people to discover our show. I\'m Shankar Vedantam, and this is NPR.</p>\n    <p>(SOUNDBITE OF MUSIC)</p>\n\n    <p class="disclaimer">Copyright &copy; 2019 NPR.  All rights reserved.  Visit our website <a href="https://www.npr.org/about-npr/179876898/terms-of-use">terms of use</a> and <a href="https://www.npr.org/about-npr/179881519/rights-and-permissions-information">permissions</a> pages at <a href="https://www.npr.org">www.npr.org</a> for further information.</p>\n\n    <p class="disclaimer">NPR transcripts are created on a rush deadline by <a href="http://www.verb8tm.com/">Verb8tm, Inc.</a>, an NPR contractor, and produced using a proprietary transcription process developed with NPR. This text may not be in its final form and may be updated or revised in the future. Accuracy and availability may vary. The authoritative record of NPR&rsquo;s programming is the audio record.</p>\n</div><div class="share-tools share-tools--secondary" aria-label="Share tools">\n      <ul>\n

我希望比对

</p>\n    <p>If you like

但是实际上它远不止于此。

我觉得我使用的正则表达式有问题,但不知道如何解决。任何帮助将不胜感激。

谢谢!

20190523: 谢谢你们的建议。

我尝试过

egrep "unsung hero.*?</p>" test

但是它并没有给我想要的结果,就像 test result of .*?

狮子,我觉得这是一个有用的表达,我想弄清楚它。你能解释一下吗?

我做过的另一项测试

[^<]*

实际上给出了预期的结果 enter image description here

1 个答案:

答案 0 :(得分:3)

对于.*,匹配将是贪婪的,并且匹配可能的最长子串。 (在您的情况下,直到最后一段。)

您真正想要的是与.*?的非贪婪匹配

您特定的命令很可能看起来像这样:

grep -P -o "unsung hero.*?</p>" test

另一种解决方案是将正则表达式扩展到字符串/网页的末尾,而不是使用组来选择选定的子字符串。

更新

正如Charles Duffy正确指出的那样,这不适用于标准(POSIX ERE)语法。因此,上面的命令使用-P标志来指定它是一个perl正则表达式。

如果您的系统或应用程序不支持perl正则表达式,并且可以在第一个<之前进行匹配(而不是在第一个</p>之前进行匹配),请匹配除{{1}之外的所有字符}是必经之路。

在此,完整的命令应如下所示:

<

感谢查尔斯在评论中指出这一点。