Question

所以我有html文件。我需要从中提取所有链接和图像。所以基本上我需要：

<a href="this_is_what_I_need">和<img src="this_is_also_needed">

我逐行阅读文件并且可以获得它，但只有第一个：

    List<string> links = new List<string>();
    if (line.Contains(@"<a href=""") || line.Contains(@"<img src="""))
    {
        if (line.Contains(@"<a href=""")
        {
            links.Add(line.Split(new string[] { @"<a href""" }, StringSplitOptions.None)[1].Split('"')[0]);
        }
        else
        {
            links.Add(line.Split(new string[] { @"<a href=""" }, StringSplitOptions.None)[1].Split('"')[0]);
        }
    }

但是一行可能包含多个链接和/或图像。那么如何获得所有？

Answer 1

我认为你没有使用正确的方法做到这一点，我建议的是看看像HtmlAgilityPack这样的报废工具，它是为了做这些事而优化的

这是使用<a href=""执行此操作的示例，但您可以针对<img src="""进行调整

HtmlDocument doc = new HtmlDocument();
doc.Load("mytest.htm");

foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a[@class='dn-index-link']"))
{
    Console.WriteLine("node:" + node.GetAttributeValue("href", null));
}

从字符串中获取多个子字符串

1 个答案: