Question

我需要使用正则表达式从文本中提取作者。另外，我需要每个标签和作者的索引。我尝试了很少的解析器，没有一个能正确保存索引。所以唯一的解决方案是使用正则表达式。我有跟随正则表达式，它在＆＃34; [^]＆＃34; 我怎么能修复这个正则表达式：

<post\\s*author=\"([^\"]+)\"[^>]+>[^</post>]*</post>

以便在以下文字中提取作者：

<post author="luckylindyslocale" datetime="2012-03-03T04:52:00" id="p7">
<img src="http://img.photobucket.com/albums/v303/lucky196/siggies/ls1.png"/>

Grams thank you, for this wonderful tag and starting this thread. I needed something to encourage me to start making some new tags.

<img src="http://img.photobucket.com/albums/v303/lucky196/holidays/stpatlucky.jpg"/>
Cruelty is one fashion statement we can all do without. ~Rue McClanahan
</post>

Answer 1

为什么不能正则表达式：
<post\\s*author=\"([^\"]+)\"[^>]+>[^</post>]*</post>
以下文提取作者。

由于

[^</post>]*

表示一个字符类，它将匹配所有字符字符<，/，p，o，s ，t和> 0次或更多次。

你的文字中没有发生这种情况。至于如何修复它，请考虑使用以下正则表达式

<post\s*author=\"([^\"]+?)\"[^>]+>(.|\s)*?<\/post>
// obviously, escape appropriate characters in Java String literals

带有多线标志。

Answer 2

您可以像以下一样

/<post author="(.*?)"/

Working Demo

使用Regex not being the best tool to parse HTML的评论是正确的。但这应该做你想要的

使用正则表达式从xml中提取信息

2 个答案: