从这个网站http://www.gwblawfirm.com/contact-us/下面的xpath调用提取了24个元素。但我只想要四个城市元素(安德森,夏洛特,哥伦比亚和格林维尔 - 12:15)。如果国家出现了可以正常的城市。
require(XML)
doc <- htmlTreeParse('http://www.gwblawfirm.com/contact-us/', useInternal = TRUE)
xpathSApply(doc, "//ul[@class='menu']/li/a", xmlValue, trim = TRUE)
[1] "Home" "About" "Staff" "Abnormal Use Blog" "Contact Us"
[6] "Attorneys" "Practice Areas" "Industries" "News" "Resources"
[11] "Career Center" "Anderson, SC" "Charlotte, NC" "Columbia, SC" "Greenville, SC"
[16] "Home" "Attorneys" "Practice Areas" "Industries" "About"
[21] "News" "Career Center" "Contact Us" "Disclaimer"
这个问题建议如下,但它返回全部24。 properly express the node range from 3 to 10
xpathSApply(doc, "//ul[@class='menu']/li/a[position()>=1 and position()<=16]", xmlValue, trim = TRUE)
如何匹配并仅返回非城市元素?
答案 0 :(得分:1)
您需要使用括号来标识整个XPath结果中的<a>
位置,否则position()
会被识别为同一<li>
父节点中的本地位置:
(//ul[@class='menu']/li/a)[position()>=12 and position()<=15]
更好的选择是根据<ul>
代码获取<h2 class="widgettitle">Contact</h2>
:
//h2[@class='widgettitle' and .='Contact']/following-sibling::ul[@class='menu'][1]/li/a