Question

我正在尝试编写一个正则表达式来从URL中提取，但问题是“。”与我们已经知道的新行不符。如何编写正则表达式来匹配和提取pageTitle（。*？），但换行可以在任何地方之间

我正在使用grails。

Answer 1

虽然您不能使用正则表达式来解析一般HTML，但在这种情况下您可能会使用它。在Groovy中，您可以使用(?s)运算符使点匹配换行符。您还应该使用(?i)运算符来使您的正则表达式不区分大小写。您可以将这些组合为(?is)。

例如

def titleTagWithNoLineBreaks = "<title>This is a title</title>"
def titleTagWithLineBreaks = """<title>This is
a title</title>"""

// Note the (?is) at the beginning of the regex
// The 'i' makes the regex case-insensitive
// The 's' make the dot match newline characters
def pattern = ~/(?is)<title>(.*?)<\/title>/

def matcherWithNoLineBreaks = titleTagWithNoLineBreaks =~ pattern
def matcherWithLineBreaks = titleTagWithLineBreaks =~ pattern

assert matcherWithNoLineBreaks.size() == 1
assert matcherWithLineBreaks.size() == 1

assert matcherWithLineBreaks[0][1].replaceAll(/\n/,' ') == "This is a title"

希望有所帮助。

Answer 2

假设它适用于PHP：

preg_match( "#<title>(.*?)</title>#s", $source, $match );
$title = $match[1];

无论您使用的是哪种软件，添加s扩展程序都会修改.（任何字符），以便包含换行符。

Answer 3

如果您只需要解析可能格式错误的HTML文档，则可以尝试使用TagSoup解析器。然后你可以使用GPath表达式而不必担心像“＆lt; / title＆gt;”这样的怪异在标题中的评论等。

import org.ccil.cowan.tagsoup.Parser

final parser  = new Parser()
final slurper = new XmlSlurper(parser)
final html    = slurper.parse('http://www.example.com/')

println html.depthFirst().find { it.name() == 'title' }

正则表达式匹配<title> </title>包括任何地方的换行符

3 个答案: