如何从值中过滤特定单词

时间:2019-09-22 03:56:17

标签: r web-scraping

我想过滤出x中包含“印度”一词的值。

webpage <- getURL("https://www.sec.gov/Archives/edgar/data/789019/000156459019027952/msft-10k_20190630.htm")
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
# parse the tree by tables
x <- xpathSApply(pagetree, "//*/table", xmlValue)  
# do some clean up with regular expressions
x <- unlist(strsplit(x, "\n"))
x <- gsub("\t","",x)
x <- sub("^[[:space:]]*(.*?)[[:space:]]*$", "\\1", x, perl=TRUE)
x <- x[!(x %in% c("", "|"))]

1 个答案:

答案 0 :(得分:2)

我们可以使用rvest,我觉得它更容易抓取。

library(rvest)
url <- "https://www.sec.gov/Archives/edgar/data/789019/000156459019027952/msft-10k_20190630.htm"

url %>%
  read_html %>%
  html_text() %>%
  strsplit("\n") %>%
  .[[1]] %>%
  grep("India", ., value = TRUE)