Question

从这个网站http://www.gwblawfirm.com/contact-us/下面的xpath调用提取了24个元素。但我只想要四个城市元素（安德森，夏洛特，哥伦比亚和格林维尔 - 12:15）。如果国家出现了可以正常的城市。

require(XML)
doc <- htmlTreeParse('http://www.gwblawfirm.com/contact-us/', useInternal = TRUE)        
xpathSApply(doc, "//ul[@class='menu']/li/a", xmlValue, trim = TRUE)
 [1] "Home"              "About"             "Staff"             "Abnormal Use Blog" "Contact Us"       
 [6] "Attorneys"         "Practice Areas"    "Industries"        "News"              "Resources"        
[11] "Career Center"     "Anderson, SC"      "Charlotte, NC"     "Columbia, SC"      "Greenville, SC"   
[16] "Home"              "Attorneys"         "Practice Areas"    "Industries"        "About"            
[21] "News"              "Career Center"     "Contact Us"        "Disclaimer"

这个问题建议如下，但它返回全部24。 properly express the node range from 3 to 10

xpathSApply(doc, "//ul[@class='menu']/li/a[position()>=1 and position()<=16]", xmlValue, trim = TRUE)

如何匹配并仅返回非城市元素？

Answer 1

您需要使用括号来标识整个XPath结果中的<a>位置，否则position()会被识别为同一<li>父节点中的本地位置：

(//ul[@class='menu']/li/a)[position()>=12 and position()<=15]

更好的选择是根据<ul>代码获取<h2 class="widgettitle">Contact</h2>：

//h2[@class='widgettitle' and .='Contact']/following-sibling::ul[@class='menu'][1]/li/a

xpath表达式匹配位置范围或位置子集

1 个答案: