正则表达式不会在html标记之间返回字符串

时间:2011-12-09 03:07:12

标签: html regex

由于这个问题,我一直在拔头发:我正在尝试提取底部代码的Loaction块中包含的文本。我想提取这个:

<h3 class="blue">Location</h3><p class="desc">This elegant luxurious hotel is located in the middle of stunning greenery on a hill, overlooking the sand/ pebble beach of Ixia, which is accessed just over the promenade (around 200 m away). The glamorous building, which is based on architecture from the Middle Ages is stylish and designed in classical, elegant decor. The island's capital of Rhodes Town is located around 4 km from the hotel and Rhodes' airport is roughly 9 km away whilst public transport departs from a stop located just 200 m away.</p>

<h3 class="blue">Location<\/h3><p\s(.*)\s.<\/p>

但它不起作用。请有人帮忙。此致

 ...In addition, there is also playground for younger guests in the hotel grounds.</p><h3 class="blue">Location</h3><p class="desc">This elegant luxurious hotel is located in the middle of stunning greenery on a hill, overlooking the sand/ pebble beach of Ixia, which is accessed just over the promenade (around 200 m away). The glamorous building, which is based on architecture from the Middle Ages is stylish and designed in classical, elegant decor. The island's capital of Rhodes Town is located around 4 km from the hotel and Rhodes' airport is roughly 9 km away whilst public transport departs from a stop located just 200 m away.</p><h3 class="blue">Rooms</h3><p class="desc">The comfortable rooms include an en suite bathroom with hairdryer, bathrobe, slippers, a direct dial telephone, satellite/ cable TV, a minibar, air conditioning (centrally regulated), a hire safe as well as a terrace or balcony.</p><h3 class="blue">Sports</h3><p class="desc">In the outdoor complex are 2 swimming pools with children's pools, a...

3 个答案:

答案 0 :(得分:2)

如果您选择的语言有一个解析HTML的库,您应该使用它。正则表达式并不总是最好的工具,但如果您熟悉输入,则可以将它拉下来。

也就是说,你的模式是贪婪的,因此它将超出第一个结束段标记。为了使其不贪婪,您需要使用.*?(注意添加?)。

此外,通常没有必要逃避正斜杠(但我猜您使用的是PHP,基于您的历史记录),使用\s.会导致您的匹配失败,因为文本没有以一个空格后跟一个字符结束。 .是与任何角色匹配的元字符。如果您打算匹配一段时间,则需要将其转义为文字,例如\.

我更喜欢使用\b来表示字边界,而不是在\s标记之后使用p。最后,除非您想捕获段落文本,否则无需使用捕获组(.*?)。解决所有这些问题让您了解这一点:

<h3 class=\"blue\">Location<\/h3><p\b.*?<\/p>

如果要捕获段落文本,可以采用以下方法:

<h3 class=\"blue\">Location<\/h3><p[^>]*>(.*?)<\/p>
  • [^>]*匹配任何不大于符号的字符,零次或多次。请注意,这部分模式的好处是它也非贪婪,因为一旦遇到大于符号,匹配就会停止。
  • >匹配大于符号的文字
  • (.*?)捕获内部段落内容的小组

答案 1 :(得分:0)

正则表达式的结尾有\s.<\/p>。段落的结尾有ay.</p>\s匹配空格字符,但您的输入中有y,匹配失败。

答案 2 :(得分:0)

只需删除第一组后的\s即可。在字符串中的任何点之前没有空格。

<h3 class="blue">Location<\/h3><p\s(.*).<\/p>