Question

for (int i = 0; i < numberoflinks; i++)
{
    string downloadString = client.DownloadString(mainlink+i+".html");
    var document = new HtmlWeb().Load(url);
    var urls = document.DocumentNode.Descendants("img")
                        .Select(e => e.GetAttributeValue("src", null))
                        .Where(s => !String.IsNullOrEmpty(s))
}

问题是HtmlWeb（）。加载需要一个html url，但我想加载已经内置html内容的字符串downloadString。

更新

我现在尝试了这个：

for (int i = 0; i < numberoflinks; i++)
            {

                string downloadString = client.DownloadString(mainlink+i+".html");
                HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
                document.Load(downloadString);
                var urls = document.DocumentNode.Descendants("img")
                                                .Select(e => e.GetAttributeValue("src", null))
                                                .Where(s => !String.IsNullOrEmpty(s));
            }

但我在线上有例外：

document.Load(downloadString);

路径中的非法字符

我要做的是从每个链接下载/提取所有.JPG图像。无需首先将网址下载到硬盘，但将内容下载到字符串中，然后在此HTML中提取以.JPG结尾的所有图像链接，然后下载JPG。

Answer 1

您应该能够使用LoadHtml() HtmlDocument方法处理HTML字符串。

来自源代码：

public void LoadHtml(string html)

从指定的字符串加载HTML文档。

param name="html"

包含要加载的HTML文档的字符串。不能为空。

Load方法需要一个文件名，这是illegal characters in path消息的原因。

如何使用htmlagilitypack从带有html内容的字符串中提取链接？

1 个答案: