我正在学习编写网络抓取工具,并找到一些很好的例子让我开始,但由于我是新手,我对编码方法有一些问题。
例如,可以在此处找到搜索结果:Search Result
当我查看结果的HTML源代码时,我可以看到以下内容:
<HR><CENTER><H3>License Information *</H3></CENTER><HR>
<P>
<CENTER> 06/03/2014 </CENTER> <BR>
<B>Name : </B> WILLIAMS AJAYA L <BR>
<B>Address : </B> NEW YORK NY <BR>
<B>Profession : </B> ATHLETIC TRAINER <BR>
<B>License No: </B> 001475 <BR>
<B>Date of Licensure : </B> 01/12/07 <BR>
<B>Additional Qualification : </B> Not applicable in this profession <BR>
<B> <A href="http://www.op.nysed.gov/help.htm#status"> Status :</A></B> REGISTERED <BR>
<B>Registered through last day of : </B> 08/15 <BR>
如何使用HTMLAgilityPack从网站中删除这些数据?
我正在尝试实现如下所示的示例,但不确定在哪里进行编辑以使其能够抓取页面:
private void btnCrawl_Click(object sender, EventArgs e)
{
foreach (SHDocVw.InternetExplorer ie in shellWindows)
{
filename = Path.GetFileNameWithoutExtension( ie.FullName ).ToLower();
if ( filename.Equals( "iexplore" ) )
txtURL.Text = "Now Crawling: " + ie.LocationURL.ToString();
}
string url = ie.LocationURL.ToString();
string xmlns = "{http://www.w3.org/1999/xhtml}";
Crawler cl = new Crawler(url);
XDocument xdoc = cl.GetXDocument();
var res = from item in xdoc.Descendants(xmlns + "div")
where item.Attribute("class") != null && item.Attribute("class").Value == "folder-news"
&& item.Element(xmlns + "a") != null
//select item;
select new
{
Link = item.Element(xmlns + "a").Attribute("href").Value,
Image = item.Element(xmlns + "a").Element(xmlns + "img").Attribute("src").Value,
Title = item.Elements(xmlns + "p").ElementAt(0).Element(xmlns + "a").Value,
Desc = item.Elements(xmlns + "p").ElementAt(1).Value
};
foreach (var node in res)
{
MessageBox.Show(node.ToString());
tb.Text = node + "\n";
}
//Console.ReadKey();
}
Crawler助手类:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Xml.Linq;
namespace CrawlerWeb
{
public class Crawler
{
public string Url
{
get;
set;
}
public Crawler() { }
public Crawler(string Url)
{
this.Url = Url;
}
public XDocument GetXDocument()
{
HtmlAgilityPack.HtmlWeb doc1 = new HtmlAgilityPack.HtmlWeb();
doc1.UserAgent = "Mozilla/4.0 (conpatible; MSIE 7.0; Windows NT 5.1)";
HtmlAgilityPack.HtmlDocument doc2 = doc1.Load(Url);
doc2.OptionOutputAsXml = true;
doc2.OptionAutoCloseOnEnd = true;
doc2.OptionDefaultStreamEncoding = System.Text.Encoding.UTF8;
XDocument xdoc = XDocument.Parse(doc2.DocumentNode.SelectSingleNode("html").OuterHtml);
return xdoc;
}
}
}
tb
是一个多行文本框...所以我希望它显示以下内容:
Name
WILLIAMS AJAYA L
Address
NEW YORK NY
Profession
ATHLETIC TRAINER
License No
001475
Date of Licensure
1/12/07
Additional Qualification
Not applicable in this profession
Status
REGISTERED
Registered through last day of
08/15
我想将第二个参数添加到数组中,因为下一步是写入SQL数据库......
我可以从IE中获取具有搜索结果的URL,但是如何在我的脚本中对其进行编码?
答案 0 :(得分:1)
这个小片段应该让你开始:
HtmlDocument doc = new HtmlDocument();
WebClient client = new WebClient();
string html = client.DownloadString("http://www.nysed.gov/coms/op001/opsc2a?profcd=67&plicno=001475&namechk=WIL");
doc.LoadHtml(html);
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div");
您基本上使用WebClient
类下载HTML文件,然后将该HTML加载到HtmlDocument
对象中。然后,您需要使用XPath来查询DOM树并搜索节点。在上面的示例&#34;节点&#34;将包含文档中的所有div
元素。
以下是有关XPath语法的快速参考:http://msdn.microsoft.com/en-us/library/ms256086(v=vs.110).aspx