Question

我从这些链接中读取数据。

> library(XML)
> url <- "http://biostat.jhsph.edu/~jleek/contact.html"
> html <- htmlTreeParse(url, useInternalNodes=T)

然后我想从中提取第十行以计算其字符数。我该怎么办？

Answer 1

你正在寻找这个吗？从html（id = main）的开头找到第十行，提取其值，并计算提取内容中的字符。

> url <- "http://biostat.jhsph.edu/~jleek/contact.html"
> html <- htmlTreeParse(url, useInternalNodes=T)
> xpathSApply(html, "//div[@id = 'main']", xmlValue, trim = TRUE)
[1] "Contact Information\n\n\t\t\t  Address \n\t\t\t  \n\t\t\t  Johns Hopkins University \n\t\t\t  Bloomberg School of Public Health \n\t\t\t  615 North Wolfe Street \n\t\t\t  Baltimore, MD 21205-2179 \n\t\t\t  Phone\n\t\t\t  410-955-1166 (I am much easier to reach by email)\n\t\t\t  Fax\n\t\t\t  410-955-0958\n\t\t\t  Email\n\t\t\t   jleek || jhsph dot edu \n\t\t\t  Twitter\n\t\t\t   @leekgroup\n\t\t\t  Blog\n\t\t\t   Simply Statistics"

然后用nchar()包裹上面的内容并将其分配给一个对象，这里是字符。

> characters <- nchar(xpathSApply(html, "//div[@id = 'main']", xmlValue, trim = TRUE))
> characters
[1] 369

您可以使用gsub()删除标签和新行标记。

如何使用R从XML中提取一行？

1 个答案: