我需要使用HtmlAgilityPack和C#解析这个html代码。我可以得到 div class =“patent_bibdata”节点,但我不知道如何通过子节点循环。
在这个样本中有6个href,但我需要将它们分成两组;发明人,分类。我对最后两个不感兴趣。这个div中可以有任意数量的href。
正如您所看到的,在两组之前有一个文字说明了什么是hrefs。
代码段
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = m_hw.Load("http://www.google.com/patents/US3748943");
string xpath = "/html/body/table[@id='viewport_table']/tr/td[@id='viewport_td']/div[@class='vertical_module_list_row'][1]/div[@id='overview']/div[@id='overview_v']/table[@id='summarytable']/tr/td/div[@class='patent_bibdata']";
HtmlNode node = m_doc.DocumentNode.SelectSingleNode(xpath);
那你怎么做呢?
<div class="patent_bibdata">
<b>Inventors</b>:
<a href="http://www.google.com/search?tbo=p&tbm=pts&hl=en&q=ininventor:%22Ronald+T.+Lashley%22">
Ronald T. Lashley
</a>,
<a href="http://www.google.com/search?tbo=p&tbm=pts&hl=en&q=ininventor:%22Ronald+T.+Lashley%22">
Ronald T. Lashley
</a><br>
<b>Current U.S. Classification</b>:
<a href="http://www.google.com/url?id=3eF8AAAAEBAJ&q=http://www.uspto.gov/web/patents/classification/uspc084/defs084.htm&usg=AFQjCNEZRFtAyKTfNudgc-XVt2-VboD77Q#C084S31200P">84/312.00P</a>;
<a href="http://www.google.com/url?id=3eF8AAAAEBAJ&q=http://www.uspto.gov/web/patents/classification/uspc084/defs084.htm&usg=AFQjCNEZRFtAyKTfNudgc-XVt2-VboD77Q#C084S31200R">84/312.00R</a><br>
<br>
<a href="http://www.google.com/url?id=3eF8AAAAEBAJ&q=http://patft.uspto.gov/netacgi/nph-Parser%3FSect2%3DPTO1%26Sect2%3DHITOFF%26p%3D1%26u%3D/netahtml/PTO/search-bool.html%26r%3D1%26f%3DG%26l%3D50%26d%3DPALL%26RefSrch%3Dyes%26Query%3DPN/3748943&usg=AFQjCNGKUic_9BaMHWdCZtCghtG5SYog-A">
View patent at USPTO</a><br>
<a href="http://www.google.com/url?id=3eF8AAAAEBAJ&q=http://assignments.uspto.gov/assignments/q%3Fdb%3Dpat%26pat%3D3748943&usg=AFQjCNGbD7fvsJjOib3GgdU1gCXKiVjQsw">
Search USPTO Assignment Database
</a><br>
</div>
想要结果 InventorGroup =
<a href="http://www.google.com/search?tbo=p&tbm=pts&hl=en&q=ininventor:%22Ronald+T.+Lashley%22">
Ronald T. Lashley
</a>
<a href="http://www.google.com/search?tbo=p&tbm=pts&hl=en&q=ininventor:%22Ronald+T.+Lashley%22">
Thomas R. Lashley
</a>
ClassificationGroup
<a href="http://www.google.com/url?id=3eF8AAAAEBAJ&q=http://www.uspto.gov/web/patents/classification/uspc084/defs084.htm&usg=AFQjCNEZRFtAyKTfNudgc-XVt2-VboD77Q#C084S31200P">84/312.00P</a>;
<a href="http://www.google.com/url?id=3eF8AAAAEBAJ&q=http://www.uspto.gov/web/patents/classification/uspc084/defs084.htm&usg=AFQjCNEZRFtAyKTfNudgc-XVt2-VboD77Q#C084S31200R">84/312.00R</a>
我正试图抓取的页面:http://www.google.com/patents/US3748943
// Anders
PS!我知道在这个页面中发明者的名字是相同的,但在大多数人中他们是不同的!
答案 0 :(得分:4)
XPATH是你的朋友!像这样的东西会让你获得发明者的名字:
HtmlWeb w = new HtmlWeb();
HtmlDocument doc = w.Load("http://www.google.com/patents/US3748943");
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[@class='patent_bibdata']/br[1]/preceding-sibling::a"))
{
Console.WriteLine(node.InnerHtml);
}
答案 1 :(得分:2)
所以很明显我还不了解XPath。所以我提出了这个解决方案。 也许不是最聪明的解决方案,但它确实有效!
// Anders
List<string> inventorList = new List<string>();
List<string> classificationList = new List<string>();
string xpath = "/html/body/table[@id='viewport_table']/tr/td[@id='viewport_td']/div[@class='vertical_module_list_row'][1]/div[@id='overview']/div[@id='overview_v']/table[@id='summarytable']/tr/td/div[@class='patent_bibdata']";
HtmlNode nodes = m_doc.DocumentNode.SelectSingleNode(xpath);
bool bInventors = false;
bool bClassification = false;
for (int i = 0; i < nodes.ChildNodes.Count; i++)
{
HtmlNode node = nodes.ChildNodes[i];
string txt = node.InnerText;
if (txt.IndexOf("Inventor") > -1)
{
bClassification = false;
bInventors = true;
}
if (txt.IndexOf("Classification") > -1)
{
bClassification = true;
bInventors = false;
}
if (txt.IndexOf("USPTO") > -1)
{
bClassification = false;
bInventors = false;
}
string name = node.Name;
if (name.IndexOf("a") > -1)
{
if (bInventors)
{
string inventor = node.InnerText;
inventorList.Add(inventor);
}
if (bClassification)
{
string classification = node.InnerText;
classificationList.Add(classification);
}
}