使用R和XML,XPath 1.0表达式可以消除返回内容中的重复项吗?

时间:2014-11-16 17:38:18

标签: r xpath xml-parsing

当我使用XPath 1.0从以下URL中提取内容时,返回的城市包含重复项,从Birmingham开始。 (返回的完整值集超过140,所以我截断了它。)有没有办法使用XPath表达式来避免重复?

require(XML)
doc <- htmlTreeParse("http://www.littler.com/locations", useInternal = TRUE)
xpathSApply(doc, "//div[@class = 'mm-location-usa']//a[position() < 12]", xmlValue, trim = TRUE)

 [1] "Birmingham"          "Mobile"              "Anchorage"           "Phoenix"             "Fayetteville"        "Fresno"             
 [7] "Irvine"              "L.A. - Century City" "L.A. - Downtown"     "Sacramento"          "San Diego"           "Birmingham"         
[13] "Mobile"              "Anchorage"           "Phoenix"             "Fayetteville"        "Fresno"              "Irvine"             
[19] "L.A. - Century City" "L.A. - Downtown"     "Sacramento"          "San Diego"

是否存在XPath表达式或解决[not-duplicate()]的问题?

此外,各种[position()&lt; X]排列不仅仅产生城市而且只产生每个城市的一个实例。事实上,很难弄清楚职位的计算方式。

我很感激任何指导或发现我能做的最好的事情是限制返回的重复数量。

BTW XPath result with duplicates不是同一个问题,也不是与重复节点有关的问题,例如How do I identify duplicate nodes in XPath 1.0 using an XPathNavigator to evaluate?

3 个答案:

答案 0 :(得分:2)

有一个函数,它被称为distinct-values(),但不幸的是,它仅在XPath 2.0中可用。在R中,您仅限于XPath 1.0。

你能做的是

//div[@class = 'mm-location-usa']//a[position() < 12 and not(normalize-space(.) = normalize-space(following::a))]

它做什么,用简单的英语:

  

查找div元素,但前提是class属性值等于“mm-location-usa”。查找那些a元素的后代div元素,但前提是a元素的位置小于12且该a元素的规范化文本内容不相等到后面的a元素的文本内容。

但这是一种计算密集型方法,而不是最优雅的方法。我建议你采用jlhoward的解决方案。

答案 1 :(得分:1)

你不能这样做吗??

require(XML)
doc <- htmlTreeParse("http://www.littler.com/locations", useInternal = TRUE)
xPath <- "//div[@class = 'mm-location-usa']//a[position() < 12]"
unique(xpathSApply(doc, xPath, xmlValue, trim = TRUE))
#  [1] "Birmingham"          "Mobile"              "Anchorage"          
#  [4] "Phoenix"             "Fayetteville"        "Fresno"             
#  [7] "Irvine"              "L.A. - Century City" "L.A. - Downtown"    
# [10] "Sacramento"          "San Diego"          

答案 2 :(得分:1)

或者,您可以创建一个XPath来处理第一个li中的div标记(因为它们是重复的div s):

xpathSApply(doc, "//div[@id='lmblocks-mega-menu---locations'][1]/
            div[@class='mm-location-usa']/
            ul/
            li[@class='mm-list-item']", xmlValue, trim = TRUE)

##  [1] "Birmingham"          "Mobile"              "Anchorage"          
##  [4] "Phoenix"             "Fayetteville"        "Fresno"             
##  [7] "Irvine"              "L.A. - Century City" "L.A. - Downtown"    
## [10] "Sacramento"          "San Diego"           "San Francisco"      
## [13] "San Jose"            "Santa Maria"         "Walnut Creek"       
## [16] "Denver"              "New Haven"           "Washington, DC"     
## [19] "Miami"               "Orlando"             "Atlanta"            
## [22] "Chicago"             "Indianapolis"        "Overland Park"      
## [25] "Lexington"           "Boston"              "Detroit"            
## [28] "Minneapolis"         "Kansas City"         "St. Louis"          
## [31] "Las Vegas"           "Reno"                "Newark"             
## [34] "Albuquerque"         "Long Island"         "New York"           
## [37] "Rochester"           "Charlotte"           "Cleveland"          
## [40] "Columbus"            "Portland"            "Philadelphia"       
## [43] "Pittsburgh"          "San Juan"            "Providence"         
## [46] "Columbia"            "Memphis"             "Nashville"          
## [49] "Dallas"              "Houston"             "Tysons Corner"      
## [52] "Seattle"             "Morgantown"          "Milwaukee"        

我在这里假设你要去美国各地。