Question

我正在尝试解析一些网站源码流。我现在的正则表达式是：

Regex pattern = new Regex (
@"<a\b             # Begin start tag
    [^>]+?             # Lazily consume up to id attribute
    id\s*=\s*['""]?thread_title_([^>\s'""]+)['""]?  # $1: id
    [^>]+?             # Lazily consume up to href attribute
    href\s*=\s*['""]?([^>\s'""]+)['""]?             # $2: href
    [^>]*              # Consume up to end of open tag
    >                  # End start tag
    (.*?)                                           # $3: name
    </a\s*>            # Closing tag",
RegexOptions.Singleline | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace );

但它不再与链接相匹配。我添加了一个示例字符串here。

基本上我想尝试匹配这些：

<a href="http://visitingspain.com/forum/f89/how-to-get-a-travel-visa-3046631/" id="thread_title_3046631">How to Get a Travel Visa</a>

"http://visitingspain.com/forum/f89/how-to-get-a-travel-visa-3046631/" is the **Link**
304663` is the **TopicId**
"How to Get a Travel Visa" is the **Title**

在我发布的样本中，至少有3个，我没有计算其他的。

此外，我使用RegexHero（在线和免费）在将其添加到代码之前以交互方式查看我的匹配。

Answer 1

为了完整起见，这里是如何使用Html Agility Pack完成的，NuGet是.Net的强大HTML解析器（也可通过{{3}}获得，因此安装它大约需要20秒）。

加载文档，解析文档以及查找3个链接非常简单：

string linkIdPrefix = "thread_title_";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://jsbin.com/upixof");
IEnumerable<HtmlNode> threadLinks = doc.DocumentNode.Descendants("a")
                              .Where(link => link.Id.StartsWith(linkIdPrefix));

就是这样，真的。现在您可以轻松获取数据：

foreach (var link in threadLinks)
{
    string href = link.GetAttributeValue("href", null);
    string id = link.Id.Substring(linkIdPrefix.Length); // remove "thread_title_"
    string text = link.InnerHtml; // or link.InnerText
    Console.WriteLine("{0} - {1}", id, href);
}

Answer 2

这很简单，标记已更改，现在href属性出现在id之前：

<a\b             # Begin start tag
    [^>]+?             # Lazily consume up to href attribute
    href\s*=\s*['""]?([^>\s'""]+)['""]?             # $1: href
    [^>]+?             # Lazily consume up to id attribute
    id\s*=\s*['""]?thread_title_([^>\s'""]+)['""]?  # $2: id
    [^>]*              # Consume up to end of open tag
    >                  # End start tag
    (.*?)                                           # $3: name
    </a\s*>            # Closing tag

请注意：

这主要是为什么这是一个坏主意。
组号已更改。您可以使用命名组，而不是(?<ID>[^>\s'""]+)而不是([^>\s'""]+)。
引号仍然被转义（这在字符集中应该没问题）

regex hero上的示例。

Answer 3

Don't do that（好吧，almost，但不适合所有人）。 Parsers适用于此类事物。

简单的正则表达式帮助使用C＃（包括正则表达式模式）

3 个答案: