Question

我有一个这样结构的标题：

<title>WebsiteName | Page title | Slogan</title>

目前，在C＃中我使用它来获得标题：

Regex.Match(pageSource,
                @"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>",
                RegexOptions.IgnoreCase).Groups["Title"].Value;

但是，我想要的只是页面标题。

Answer 1

避免使用regex解析html。

您可以使用htmlAgilityPack

执行此操作

这将得到html的标题！

HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);    
string title=doc.DocumentNode.SelectSingleNode("//title").InnerText;

获取页面标题后，您可以使用此正则表达式获取所需数据

考虑到您的标题将始终与您的示例中给出的形式相同，您可以使用

(?<=\|).+?(?=\|)

Answer 2

如果您只想尝试Page Title，请尝试以下操作：

\|(.*)\|

如果你传递了你提供的字符串，你的第二场比赛将包含标题。如果你发现自己做了比这更复杂的事情，那么正则表达式可能不是你的工具。有更好的方法来解析HTML。

Answer 3

试试这个：

@"\<title[^>]*\>[^|]*\|\s*(?<Title>[^|]*?)\|[^<]*\</title\>"

"\<title[^>]*\>"   //Title tag
"[^|]*"            //Everything up to the first pipe
"\|\s*"            //First pipe and any leading white space
"(?<Title>[^|]*?)" //The page title section between the pipes
"\|"               //Second pipe
"[^<]*\"           //Everything after the first pipe up to closing title tag
"</title\>"        //closing title tag

正则表达式 - 获得标题的特定部分

3 个答案: