我想从这个html中选择:
doc <- htmlParse("http://eusoils.jrc.ec.europa.eu/ESDB_Archive/ESDBv3/legend/sg_attr.htm")
但我遇到一些特殊字符问题(即&gt;和&lt;符号)并得到不同长度的节点,请看这里:
legs <- getNodeSet(doc, "//a")
leg_names <- sapply(legs, xmlGetAttr, "name")
leg_descr <- xpathSApply(doc, "//strong", xmlValue)
# not the same length??
cbind(leg_names, leg_descr)
# different length??
getNodeSet(doc, '//text()[following-sibling::a]')
和
# why is this not working?
getNodeSet(doc, '//a[@name="AGLIM1"]/text()[following-sibling::strong')
最后我想在一个有两列的表中包含每个图例(带有特定名称的标签后面的文本),第一个带有值/符号,第二个带有标签。
喜欢这个WRB-FULL:
Value Label
AB Albeluvisol
ABal Alic Albeluvisol
ABap Abruptic Albeluvisol
ABar Arenic Albeluvisol
ABau Alumic Albeluvisol
ABeun Endoeutric Albeluvisol
... ... ...
答案 0 :(得分:0)
文档的格式不一致:
有<a>
个元素没有跟随<strong>
元素 -
所以还有更多的前者。
cbind( head(leg_names,8), head(leg_descr,8) )
[,1] [,2]
# [1,] "AGLIM1" "AGLIM1: Code of the most important limitation to agricultural use of the STU"
# [2,] "AGLIM2" "AGLIM2: Code of a secondary limitation to agricultural use of the STU"
# [3,] "BORDER_SOIL1M" "FAO85-FULL: Full Soil Code 1974 FAO"
# [4,] "SOIL1M" "FAO85-LEV1: Soil major group code of the STU from the 1974 (modified CEC 1985) FAO-UNESCO Soil Legend"
# [5,] "CFL" "FAO85-LEV2: Second level soil code of the STU from the 1974 (modified CEC 1985) FAO-UNESCO Soil Legend"
# [6,] "CL" "FAO85-LEV3: Third level soil code of the STU from the 1974 (modified CEC 1985) FAO-UNESCO Soil Legend"
# [7,] "COUNTRY" "FAO90-FULL:Full soil code of the STU from the 1990 FAO-UNESCO Soil Legend"
# [8,] "FAO85FU" "FAO90-LEV1: Soil major group code of the STU from the 1990 FAO-UNESCO Soil Legend"
following-sibling
方法看起来更有希望,但由于<a>
元素没有立即跟随<strong>
元素,因此您最终可能会得到另一个元素的描述。
getNodeSet(doc, '//a[@name="AGLIM1"]/following-sibling::strong/text()')[[1]]
另一种方法是忘记格式化并将文件视为文本文件。
raw_data <- readLines("http://eusoils.jrc.ec.europa.eu/ESDB_Archive/ESDBv3/legend/sg_attr.htm")
library(stringr)
matches <- str_extract(raw_data, '<a .*<strong>.*')
matches <- matches[ ! is.na(matches) ]
result <- str_match(matches, '<a name="(.*?)".*<strong>(.*)</strong>')[,-1]
head(result)
[,1] [,2]
[1,] "AGLIM1" "AGLIM1: Code of the most important limitation to agricultural use of the STU"
[2,] "AGLIM2" "AGLIM2: Code of a secondary limitation to agricultural use of the STU"
[3,] "FAO85FU" "FAO85-FULL: Full Soil Code 1974 FAO"
[4,] "FAO85LV1" "FAO85-LEV1: Soil major group code of the STU from the 1974 (modified CEC 1985) FAO-UNESCO Soil Legend"
[5,] "FAO85LV2" "FAO85-LEV2: Second level soil code of the STU from the 1974 (modified CEC 1985) FAO-UNESCO Soil Legend"
[6,] "FAO85LV3" "FAO85-LEV3: Third level soil code of the STU from the 1974 (modified CEC 1985) FAO-UNESCO Soil Legend"