Question

我正在从网页中提取内容。在网页中，电话号码和电子邮件ID等信息存储在图像中。我想提取图像以及该表中的文本。在输出字符串中，我希望输出与在带有图像和文本的网页中显示的方式相同。

以下是网页内容。

<table>
<tr>
   <td>text</td>
   <td><img src="" /></td>
</tr>
<tr>
   <td>text</td>
   <td><img src="" /></td>
</tr>
<tr>
   <td>text</td>
   <td><img src="" /></td>
</tr>
</table>

我可以像这样提取文字和图像：

text img

text img

text img

Answer 1

HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
HtmlNode imgNode = doc.DocumentElement.selectSingleNode("/table/tr/td/img");

//Just get Images only
foreach (HtmlNode img in doc.DocumentElement.SelectNodes("//img"))
{
  string imgSrc = img.Attributes["src"].Value;
}

//get td's and ignore img in it
foreach (HtmlNode td in doc.DocumentElement.SelectNodes("//td"))
{
  HtmlNode img = td.ChildNodes["img"];
  if(img == null)
  {
    string tdText = td.InnerText;
  }
}

//Get Images that have style attribute
foreach (HtmlNode img in doc.DocumentElement.SelectNodes("//img[@style]"))
{
  string style = img.Attributes["style"].Value.ToLower();
  style = style.Replace("background:url('", "");
  style = style.Replace("')", "");
 //now you have the image url from the background

}

Answer 2

试试这个

foreach (HtmlNode img in root.SelectNodes("//img"))
{
    string att = img.Attributes["src"].Value;
    anchorTags.Add(att);
}

使用htmlagilitypack提取文本和图像

2 个答案: