应用错误收集

从R中的URL中提取参数

时间：2018-04-03 10:56:15

标签： r regex substring gsub

我想从一批网址中删除'destinationId'参数。

如果我有这样的网址：

https://urlaub.xxx.de/lastminute/europa/zypern-griechenland/?destinationId=45&semcid=de.ub

我如何提取45？（目标-ID = 45）

我试图使用这样的东西，我无法工作：

destinationIdParameter <- sub("[^0-9].*","",sub("*?\\destinationId=","",url))

5 个答案:

答案 0 :(得分：2)

使用stringr，你可以这样：

> library(stringr)
> address <- "https://urlaub.xxx.de/lastminute/europa/zypern-griechenland/?destinationId=45&semcid=de.ub"
> str_match(address, "destinationId=(.*?)&")[,2]
[1] "45"

如果（像我一样）你对正则表达式不满意，请使用qdapRegex包：

> library(qdapRegex)
> address <- "https://urlaub.xxx.de/lastminute/europa/zypern-griechenland/?destinationId=45&semcid=de.ub"
> ex_between(address, "destinationId=", "&")
[[1]]
[1] "45"

答案 1 :(得分：1)

使用基数R，您可以通过几种方式提取数字。如果你确定在这种网址中只有一个数字，你可以删除所有不是数字的数字：

> url <- "https://urlaub.xxx.de/lastminute/europa/zypern-griechenland/?destinationId=45&semcid=de.ub"
> gsub("[^0-9]", "", url)
[1] "45"

或者，如果您想要更安全，并希望在＆＃34; destinationId =＆＃34;之后找到特定的号码;没有任何其他，那么你会做这样的事情：

destId <- regmatches(url, gregexpr("destinationId=\\d+", url)) 
gsub("[^0-9]", "", destId)

答案 2 :(得分：1)

如果要从网址中提取destinationId值，则可以执行以下操作：

gsub(".+destinationId=(\\d+).+", "\\1", url)

此处\\1指的是()内的内容。
.+匹配任何字符序列

答案 3 :(得分：0)

使用基座R，我们可以：

url <- "https://urlaub.xxx.de/lastminute/europa/zypern-griechenland/?destinationId=45&semcid=de.ub"

extract <- function(url) {
  pattern <- "destinationId=\\K\\d+"
  (id <- regmatches(url, regexpr(pattern, url, perl = TRUE)))
}

print(extract(url))

<小时/> 或者（没有perl = TRUE）：

vanilla_extract <- function(url) {
  pattern <- "destinationId=([^&]+)"
  (regmatches(url, regexec(pattern, url))[[1]][2])
}

两者都屈服

[1] "45"

答案 4 :(得分：0)

我认为最好的方法是parameters()

library(urltools)
example_url <- "http://en.wikipedia.org/wiki/Aaron_Halfaker?debug=true"
parameters(example_url)