Question

我多次调试程序以获得如下结果：

url                  研究所知识库列表
/handle/1471x/1      力学研究所
/handle/1471x/8865   半导体研究所

但是，没有metter我使用什么参数，结果不正确。这个表中的内容是我进一步分析的基础的一部分，我为此感到非常震惊。我非常期待你的帮助。

    ## download community-list ---the 1st level of IR Grid
        #loading webpage and analyzing
        community_url<-"http://www.irgrid.ac.cn/community-list"
        com_source <- readLines(community_url, encoding = "UTF-8")
        com_parsed <- htmlTreeParse(com_source, encoding = "UTF-8", useInternalNodes = TRUE)
        # get table specs
        tableNodes <- getNodeSet(com_parsed, "//table")
        com_tb<-readHTMLTable(tableNodes[[8]], header=TRUE)
        # get External links
        xpath <- "//a/@href"
        getHTMLExternalFiles(tableNodes[[8]], xpQuery = xpath)

Answer 1

目前还不清楚您希望最终结果看起来是什么样的，但如果您稍微修改一下xpath语句以利用DOM结构，您可以得到类似的结果：

library(XML)
community_url<-"http://www.irgrid.ac.cn/community-list"
com_source <- readLines(community_url, encoding = "UTF-8")
com_parsed <- htmlTreeParse(com_source, encoding = "UTF-8", useInternalNodes = TRUE)
list_header <- xpathSApply(com_parsed, '//table[.//li]//h1', xmlValue)
hrefs <- xpathSApply(com_parsed, '//li[@class="communityLink"]//@href', function(x) unname(x))
display_text <- xpathSApply(com_parsed, '//li[@class="communityLink"]//a', xmlValue)
table_data <- cbind(display_text, hrefs)
colnames(table_data) <- c(list_header, "url")
table_data

控制台输出导致stackoverflow认为这个答案是垃圾邮件，但这是一个屏幕截图：

如何阅读网页中的<li>表格

1 个答案: