获取URL的特定部分

时间:2018-05-29 06:57:24

标签: r regex

以下是少量网址。我想从该网址获取特定号码。

https://www.sec.gov/Archives/edgar/data/1002638/000100263816000080/exhibit211subsidiarylisting.htm
http://www.sec.gov/Archives/edgar/data/1013871/000101387113000003/exhibit21110k2012.htm
http://www.sec.gov/Archives/edgar/data/1420800/000142080014000006/exhibit211subsidiariesofth.htm
http://www.sec.gov/Archives/edgar/data/1305014/000130501415000119/a9302015exhibit21.htm

我想得到如下输出:

1002638
1013871
1420800
1305014

你能帮我解决一下这个问题。

2 个答案:

答案 0 :(得分:1)

我这样做:

myurl <-c("https://www.sec.gov/Archives/edgar/data/1002638/000100263816000080/exhibit211subsidiarylisting.htm",
       "http://www.sec.gov/Archives/edgar/data/1013871/000101387113000003/exhibit21110k2012.htm", 
       "http://www.sec.gov/Archives/edgar/data/1420800/000142080014000006/exhibit211subsidiariesofth.htm", 
       "http://www.sec.gov/Archives/edgar/data/1305014/000130501415000119/a9302015exhibit21.htm")

# split each string into substrings, with the backslashes as separators
# then take the seventh element of each result
unlist(lapply(myurl, function(u) strsplit(u, "/")[[1]][7]))

"1002638" "1013871" "1420800" "1305014"

答案 1 :(得分:0)

使用sep = "/"读取文件,然后获取相关列:

df1 <- read.table(text = "
https://www.sec.gov/Archives/edgar/data/1002638/000100263816000080/exhibit211subsidiarylisting.htm
http://www.sec.gov/Archives/edgar/data/1013871/000101387113000003/exhibit21110k2012.htm
http://www.sec.gov/Archives/edgar/data/1420800/000142080014000006/exhibit211subsidiariesofth.htm
http://www.sec.gov/Archives/edgar/data/1305014/000130501415000119/a9302015exhibit21.htm
                  ", sep = "/")


df1$V7
# [1] 1002638 1013871 1420800 1305014