使用HAP LinQ解析网页

时间:2014-09-26 13:34:46

标签: c# linq html-agility-pack windows-phone-8.1

我正在尝试创建一个带有网页内容的wp 8.1应用程序。我的问题是xpath似乎不适用于WP8.1,所以我试图使用LinQ,但我不太了解它。 页面是这样的:

<body>
    <table cellspacing="0" cellpadding="0" border="0" style="border-style:none; padding:0; margin:0;" id="ctl00_ContentPlaceHolder1_ListView1_groupPlaceholderContainer">               
         <tbody>
             <tr style="border-style:none;padding:0; margin:0; background-image:none; vertical-align:top;" id="ctl00_ContentPlaceHolder1_ListView1_ctrl0_itemPlaceholderContainer">         
                 <td style="border-style:none;padding:0; margin:0; width:22%;" id="ctl00_ContentPlaceHolder1_ListView1_ctrl0_ctl01_Td3">
                    <div class="photo">
                        <a target="_self" title="PH1" href="fumetto.aspx?Fumetto=279277">PH1_1</a>
                    </div>
                </td>
            </tr>
            <tr style="border-style:none;padding:0; margin:0; background-image:none; vertical-align:top;" id="ctl00_ContentPlaceHolder1_ListView1_ctrl0_itemPlaceholderContainer">          
                 <td style="border-style:none;padding:0; margin:0; width:22%;" id="ctl00_ContentPlaceHolder1_ListView1_ctrl0_ctl01_Td3">
                    <div class="photo">
                        <a target="_self" title="PH2" href="fumetto.aspx?Fumetto=279277">PH2_1</a>
                    </div>
                </td>
            </tr>
            <tr style="border-style:none;padding:0; margin:0; background-image:none; vertical-align:top;" id="ctl00_ContentPlaceHolder1_ListView1_ctrl0_itemPlaceholderContainer">          
                 <td style="border-style:none;padding:0; margin:0; width:22%;" id="ctl00_ContentPlaceHolder1_ListView1_ctrl0_ctl01_Td3">
                    <div class="photo">
                        <a target="_self" title="PH3" href="fumetto.aspx?Fumetto=279277">PH3_1</a>
                    </div>
                </td>
            </tr>
        </tbody>
    </table>
</body>  

我想保存属性&#34; PH1&#34;,&#34; PH2&#34;,&#34; PH3&#34;和值#34; PH1_1&#34;,&#34; PH2_1&#34;,&#34; PH3_1&#34;。你能帮助我吗?我的代码是这样的:

string filePath = "...";
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.LoadHtml(filePath);
if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count() > 0)
{
    // Handle any parse errors as required
}
else
{
    if (htmlDoc.DocumentNode != null)
    {
        //I'm trying to get the first node for now
        HtmlAgilityPack.HtmlNode aNode = htmlDoc.DocumentNode.DescendantsAndSelf("a").FirstOrDefault();
        if (aNode != null)
        {
            string first = aNode.GetAttributeValue("title", "null");
            string value = aNode.ToString();
            ...
        }
    }
}

1 个答案:

答案 0 :(得分:1)

尝试将DescendantsAndSelf()替换为Descendants()

HtmlAgilityPack.HtmlNode aNode = htmlDoc.DocumentNode
                                        .Descendants("a")
                                        .FirstOrDefault();

而不是调用ToString(),而是使用InnerText属性来获取开头和cloaing标记之间的文字:

if (aNode != null)
{
    string first = aNode.GetAttributeValue("title", "null");
    string value = aNode.InnerText;
    .....
}

<强> [.NET fiddle demo]