我想过滤出x中包含“印度”一词的值。
webpage <- getURL("https://www.sec.gov/Archives/edgar/data/789019/000156459019027952/msft-10k_20190630.htm")
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
# parse the tree by tables
x <- xpathSApply(pagetree, "//*/table", xmlValue)
# do some clean up with regular expressions
x <- unlist(strsplit(x, "\n"))
x <- gsub("\t","",x)
x <- sub("^[[:space:]]*(.*?)[[:space:]]*$", "\\1", x, perl=TRUE)
x <- x[!(x %in% c("", "|"))]
答案 0 :(得分:2)
我们可以使用rvest
,我觉得它更容易抓取。
library(rvest)
url <- "https://www.sec.gov/Archives/edgar/data/789019/000156459019027952/msft-10k_20190630.htm"
url %>%
read_html %>%
html_text() %>%
strsplit("\n") %>%
.[[1]] %>%
grep("India", ., value = TRUE)