Question

我有这个字符串流：

"do=whoposted&amp;t=1934067" rel=nofollow>61</A></TD><TD class=alt2 align=middle>5,286</TD></TR><TR><TD id=td_threadstatusicon_1911046 class=alt1><IMG id=thread_statusicon_1911046 border=0 alt="" src="http://url.com/forum/images/statusicon/thread_new.gif"> </TD><TD class=alt2><IMG title=Node border=0 alt=Node src="http://url.com/forum/images/icons/new.png"></TD><TD id=td_threadtitle_1911046 class=alt1 title="http://lulzimg.com/i14/7bd11b.jpg &#10; &#10;Complete name : cool-thread...."><DIV><A id=thread_gotonew_1911046 href="http://url.com/forum/f80/cool-topic-new/"><IMG class=inlineimg title="Go to first new post" border=0 alt="Go to first new post" src="http://url.com/forum/images/buttons/firstnew.gif"></A> [MULTI] <A style="FONT-WEIGHT: bold" id=thread_title_1911046 href="http://url.com/forum/f80/cool-topic-name-1911046/">Cool Topic Name</A> </DIV><DIV class=smallfont><SPAN style="CURSOR: pointer" onclick="window.open('http://url.com/forum/members/u2031889/', '_self')">m3no</SPAN> </DIV></TD><TD class=alt2 title="Replies: 11, Views: 1,554"><DIV style="TEXT-ALIGN: right; WHITE-SPACE: nowrap" class=smallfont>Today <SPAN class=time>08:04 AM</SPAN><BR>by <A href="http://url.com/forum/members/u1131830/" rel=nofollow>karetsos</A> <A "

我感兴趣的内容与此相似：

<A style="FONT-WEIGHT: bold" id=thread_title_1911046 href="http://url.com/forum/f80/cool-topic-name-1911046/">Cool Topic Name</A>

从这里我想要提取的是：

Thread id: 1911046 (could be from either location in the string)
Thread name: "Cool Topic Name"
Thread link: "http://url.com/forum/f80/cool-topic-name-1911046/"

目前我用这个：

Regex pattern = new Regex ( "<A\\s+href=\"([^\"]*)\">([^\\x00]*?)\\s+id=thread_title_(\\S+)</A>" );

MatchCollection matches = pattern.Matches ( doc.ToString ( ) );

foreach ( Match match in matches )
{
    int id = Convert.ToInt32 ( match.Groups [ 1 ].Value );

    string name = match.Groups [ 3 ].Value;
    string link = match.Groups [ 2 ].Value;

    ...
}

如果有人可以帮助我修复模式以匹配它，我将不胜感激。这曾经有效，但它会返回0个匹配。

Answer 1

迈克尔·帕皮尔的答案有效。从您在上一个注释中显示的模式的开头和结尾删除正斜杠（/）。正斜杠是Ruby中的模式分隔符 - 我们不在.NET中使用它们：

var rg = new Regex(@"<A(?:[^<]*)thread_title_(\d+) href=""([^""]*)"">([^<]*)");

（在verbatim string（@"..."）中，你只需要通过将它们加倍来逃避双引号。

修改：Richard添加的更正模式，以使用评论中的最新版本。原始模式与元素不匹配，但这种变化应该。有趣的是，无论你是否在引号之前添加无关的\，该模式都有效，但理查德是正确的，不需要它。

编辑（再次）：你是对的，这种模式在实际页面上不起作用。在这三个答案中，只有ridgrunner返回24场比赛。

Answer 2

假设存在任意数量的属性且href属性始终位于id之后，并且属性可能或可能不具有引号内的值，那么这个属性应该是这样的：

Regex pattern = new Regex(
    @"<A\b             # Begin start tag
    [^>]+?             # Lazily consume up to id attribute
    id\s*=\s*['""]?thread_title_([^>\s'""]+)['""]?  # $1: id
    [^>]+?             # Lazily consume up to href attribute
    href\s*=\s*['""]?([^>\s'""]+)['""]?             # $2: href
    [^>]*              # Consume up to end of open tag
    >                  # End start tag
    (.*?)                                           # $3: name
    </A\s*>            # Closing tag", 
    RegexOptions.Singleline | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);

编辑：修复了消耗了开始标记结尾部分的表达式。（[^>]+）

Answer 3

我不用c＃编程，但这里有一个在ruby中工作的正则表达式（我猜你们有\ \来表示字符类？）

/<A.*thread_title_(\d+) href=\"([^\"]*)\">([^<]*)/

EDIT 试试这个： thread_title_(\d+) href=\"([^\"]*)\"\>(.*?)<\/A>它与你做过的那件馅饼中的2件匹配。如果你必须在HTML中匹配复杂的东西，正则表达式不好，你应该使用XML / HTML解析器

Answer 4

应该这样做......

<a[^>]+thread_title_(?<id>\d+)[^>]+href="(?<link>[^"]*)">(?<name>[^<]*)</a>

其他一些建议有些过于贪婪，并且每次都会在示例文本中匹配多个链接。

要指出的另一件事是(?<link>符号，它是一个命名组。它与常规组的匹配方式相同。但是，您可以通过其名称或索引在C＃中访问这些组。

你可以在这里看到这个......

http://regexhero.net/tester/?id=7855af6f-7774-4a7c-afa2-81c3e24cf496

顺便说一句，使用Regex Hero顶部的.NET按钮生成C＃，然后将为您正确转义引号。

使用Regex进行简单的线匹配

4 个答案: