Question

我正在尝试使用C＃解析 td 标记的html文档这样

<td>Whatever string</td><td class="pass">value</td>

将返回

Whatever string : value

我花了好几个小时来解决这个问题，尝试使用XML解析器和正则表达式，但无济于事。谢谢你的帮助。

我已经尝试了

    List<string> list = Regex.Split(lineslineWithTdTag[i], "[<td>].[<\td>]").ToList();
    List<string> status = Regex.Split(list[3], "[pass=\"].\"").ToList() ;

然后我尝试解析该列表

Answer 1

冒着引起愤怒的风险，你无法用正则表达式解析HTML＆＃34;纯粹主义者，这是一个应该做你想做的正则表达式解决方案：

var match = Regex.Match(lineslineWithTdTag[I], "<td>(.*?)</td><td.*?>(.*?)</td>");
string result = String.Format(match.Groups[1].Value + " : " + match.Groups[2].Value);

当然，如果实际记录的格式不如你的例子那么好，那么所有的赌注都会关闭。

Answer 2

经过大量工作后，这最终成为我的解决方案

        string path = @"http://localhost/page.html";
        XDocument myX = XDocument.Load(path);
        string field1 = "";
        string field2 = "";
        bool flag = true;
        foreach (var name in myX.Root.DescendantNodes().OfType<XElement>())
        {
            // get the first element
            if (name.Name.LocalName == "td" && flag)
            {
                field1 = (string)name + "\n";
                flag = false;
            }
            // get the second element
            else if (name.Name.LocalName == "td")
            {
                field2 = (string)name + "\n";
                flag = true;
            }
        }
    }

解析内部文本的HTML行

2 个答案: