正则表达式用于大量货币

时间:2018-07-30 05:42:36

标签: r regex

我正在尝试编写一个表达式,该表达式从具有相应货币符号和潜在金额缩写(m或k)的字符串中提取数字:

text <- "$10000 and $10,000 and $5m and $50m and $50.2m and $50,2m"
str_extract(text, "\\$(\\d+)[a-z]+") # solution_1
str_extract(text, "\\$(\\d+)+") #solution_2

所需的输出:

"$10000 $10,000 $5m $50m $50.2m $50,2m"

问题在于solution_1仅提取“ $ 5m”,而solution_2仅提取“ $ 10000”。

更新:@Tim Biegeleisen提供了一个很好的解决方案。我还试图摆脱最后的一段时期,例如$50m. and...得到$50m

text <- "$5, $10,000, and $5m, and $50m. and $50.2m and $50,2m"
m <- gregexpr("\\$[0-9.,]+?[mbt]?(?=(?:, | |$))", text, perl=TRUE)
regmatches(text, m)

3 个答案:

答案 0 :(得分:3)

尝试将grepexprregmatches一起使用:

text <- "$10000 and $10,000 and $5m and $50m and $50.2m and $50,2m"
m <- gregexpr("\\$[0-9.,]+[mbt]?", text)

regmatches(text, m)
[[1]]
[1] "$10000"  "$10,000" "$5m"     "$50m"    "$50.2m"  "$50,2m"

Demo

我假设只有数字,逗号和小数点组成一个给定的数量字符串。我还假设该金额可能以mbt结尾(百万,十亿,万亿)。

答案 1 :(得分:0)

也可以这样做,例如这样

txt = unlist(strsplit(text, split = " "))
txt[grep("\\$\\d+((,|\\.)?)(\\d*)?(m)?", txt)]

[1] "$10000"  "$10,000" "$5m"     "$50m"    "$50.2m"  "$50,2m" 

答案 2 :(得分:0)

也许我们可以使用gsub作为OP的预期输出显示为单个字符串

gsub("\\b[A-Za-z]+,?|[,.](\\s)", "\\1", text)
#[1] "$10000  $10,000  $5m  $50m  $50.2m  $50,2m"
#[2] "$5 $10,000  $5m  $50m  $50.2m  $50,2m"     

数据

text <- c( "$10000 and $10,000 and $5m and $50m and $50.2m and $50,2m",
      "$5, $10,000, and $5m, and $50m. and $50.2m and $50,2m")