如何搜索一列链接在r中寻找字符串匹配?

时间:2017-07-26 18:41:02

标签: r paste grepl

我在同一列中有一个包含.txt链接列表的数据表。我正在寻找一种方法让R在每个链接中搜索,看看该文件是否包含字符串折扣率贴现现金流量。然后,我希望R在每个链接旁边创建2列(一个用于折扣率,一个用于贴现现金流),如果存在,则要么有一个列如果不是,则为0。

current table with links in column websiteURL

what i want my table to look like

以下是我想要筛选的一小部分示例链接:

http://www.sec.gov/Archives/edgar/data/1015328/0000913849-04-000510.txt
http://www.sec.gov/Archives/edgar/data/1460306/0001460306-09-000001.txt
http://www.sec.gov/Archives/edgar/data/1063761/0001047469-04-028294.txt
http://www.sec.gov/Archives/edgar/data/1230588/0001178913-09-000260.txt
http://www.sec.gov/Archives/edgar/data/1288246/0001193125-04-155851.txt
http://www.sec.gov/Archives/edgar/data/1436866/0001172661-09-000349.txt
http://www.sec.gov/Archives/edgar/data/1089044/0001047469-04-026535.txt
http://www.sec.gov/Archives/edgar/data/1274057/0001047469-04-023386.txt
http://www.sec.gov/Archives/edgar/data/1300379/0001047469-04-026642.txt
http://www.sec.gov/Archives/edgar/data/1402440/0001225208-09-007496.txt
http://www.sec.gov/Archives/edgar/data/35527/0001193125-04-161618.txt

1 个答案:

答案 0 :(得分:2)

也许是这样的......

checktext <- function(file, text) {
  filecontents <- readLines(file)
  return(as.numeric(any(grepl(text, filecontents, ignore.case = TRUE))))
}

df$DR <- sapply(df$file_name, checktext, "discount rate")
df$DCF <- sapply(df$file_name, checktext, "discounted cash flow")

由于下面的Gregor的评论,一个更快的版本将是

checktext <- function(file, text) {
  filecontents <- readLines(file)
  sapply(text, function(x) as.numeric(any(grepl(x, filecontents, 
               ignore.case = T))))
}

df[,c("DR","DCF")] <- t(sapply(df$file_name, checktext, 
                             c("discount rate", "discounted cash flow")))

或者,如果您是通过网址而不是本地文件执行此操作,请在上面将df$file_name替换为df$websiteURL。它在你提供的短名单上为我工作。