Question

我想从名为source.html或source.txt的源文件中使用正则表达式：

<OPTION value=5>&nbsp;&nbsp;5 - Course Alpha (3)</OPTION> <OPTION value=6>&nbsp;&nbsp;6 - Course Beta (3)</OPTION>

得到：

5 - Course Alpha (3)
6 - Course Beta (3)

我的意思是我必须找到一种模式：

<OPTION v

和

 finding first number after it

所以在我看到之前得到所有东西：

</OPTION>

如何使用Rege使用Perl实现它？

PS：它应该从文件中读取内容并将输出写入文件。

Answer 1

您不想使用正则表达式，您想使用HTML解析器。这是一个good article on the subject，它解释了为什么正则表达式很脆弱以及如何使用HTML::TreeBuilder。

还有a small pile of similar questions and answers关于从HTML文档中提取数据的方法。

Answer 2

perl -lwe '$_="<OPTION value=5>&nbsp;&nbsp;5 - Course Alpha (3)</OPTION> <OPTION value=6>&nbsp;&nbsp;6 - Course Beta (3)</OPTION>"; s/\&nbsp;//g; print $1 while /<OPTION [^>]*>([^<]+)/g'

Answer 3

怎么样？

/<OPTION v.*?>.*?(\d.+?)<\/OPTION>/

http://regexr.com?2thm8

在那里，您将在第一个捕获组中找到您的字符串。

Perl正则表达式模式匹配

3 个答案: