HtmlAgilityPack修复不打开标记

时间:2015-01-28 07:30:15

标签: c# html .net-4.5 html-agility-pack

我从url html页面获取。 在页面我得到热门打开<tr>标签

的表格
<table class="transparent">
    <tr><td>Sąrašo eil. Nr.:</td><td>B-FA001</td></tr>
    <td>Įrašymo į Sąrašą data:</td><td>2006-11-13</td></tr>
</table>

如何修复

<table class="transparent">
    <tr><td>Sąrašo eil. Nr.:</td><td>B-FA001</td></tr>
    <tr><td>Įrašymo į Sąrašą data:</td><td>2006-11-13</td></tr>
</table>

我试着做

private HtmlDocument GetHtmlDocument(string link)
{
    string url = "http://195.182.67.7/paslaugos/administratoriai/bankroto-administratoriai/" + link;
    var web = new HtmlWeb { AutoDetectEncoding = false, OverrideEncoding = Encoding.UTF8 };
    var doc = web.Load(url);
    doc.OptionFixNestedTags = true;
    doc.OptionAutoCloseOnEnd = true;
    doc.OptionCheckSyntax = true;

    // build a list of nodes ordered by stream position
    NodePositions pos = new NodePositions(doc);

    // browse all tags detected as not opened
    foreach (HtmlParseError error in doc.ParseErrors.Where(e => e.Code == HtmlParseErrorCode.TagNotOpened))
    {
        // find the text node just before this error
        var last = pos.Nodes.OfType<HtmlTextNode>().LastOrDefault(n => n.StreamPosition < error.StreamPosition);
        if (last != null)
        {
            // fix the text; reintroduce the broken tag
            last.Text = error.SourceText.Replace("/", "") + last.Text + error.SourceText;
        }
    }
    doc.Save(Console.Out);
    return doc;
}

但没有修复

1 个答案:

答案 0 :(得分:0)

对于这个特殊问题,你可以做简单的正则表达式替换:

 string wrong = "<table class=\"transparent\"><tr><td>Sąrašo eil. Nr.:</td><td>B-FA001</td></tr><td>Įrašymo į Sąrašą data:</td><td>2006-11-13</td></tr></table>";
 Regex reg = new Regex(@"(?<!(?:<tr>)|(?:</td>))<td>");
 string right = reg.Replace(wrong, "<tr><td>");
 Console.WriteLine(right);