查找最接近特定登录字符串位置的数字

时间:2020-06-22 13:02:44

标签: r regex stringr

我试图从(很长)字符串的长向量中获取工资数据。我的直觉是要过滤美元符号的位置($usddollar),然后提取最接近(在位置上)美元编号的数字。美元符号。

我无法直接从字符串中提取数字,因为字符串没有遵循特定的系统(例如,并非所有数字都表示工资数据以及美元符号的相对位置和数字会有所不同)。

一些示例数据和美元名称:

dollarnames <- tolower(c("USD", "Dollar", "[$]"))

salarylist <- c("Earn USD 5 per hour with us It is a lot of fun and you only have to work for 6 hours per day. We pay more USD than our competitors.",
                "You can become rich, too. Earn 50.000 Dollar per month and enjoy 60.000 pieces of cake per day. Enjoy Dollar! ",
                "Do you want to earn a lot of $? Then come and work with us. Earn $ 120.000 per year")

我希望将此作为输出

# earnings
# 1        5
# 2    50000
# 3   120000

我想str_locate可以以一种或多种方式提供帮助:

map(dollarnames, str_locate, string = tolower(salarylist))

非常感谢您的帮助!

2 个答案:

答案 0 :(得分:3)

您可以使用正则表达式,例如

(?i)(?<=(?:usd|dollar|[$])\s{0,100})\d+(?:\.\d+)?|\d+(?:\.\d+)?(?=\s*(?:usd|dollar|[$]))

使用stringr::str_extractstr_extract_all。参见regex demo

详细信息

  • (?i)-不区分大小写的匹配项
  • (?<=(?:usd|dollar|[$])\s{0,100})-匹配usddollar$的受限宽度lookbind,然后为0到100(如果货币符号和数字)空格
  • \d+(?:\.\d+)?-“价格”模式:1+位数字,后跟.和1+位数字的可选序列
  • |-或
  • \d+(?:\.\d+)?-一种“价格”模式
  • (?=\s*(?:usd|dollar|[$]))-与位置相匹配的正向超前,紧跟着0+个空格,然后是usddollar$字符。

R demo

dollarnames <- tolower(c("USD", "Dollar", "[$]"))

salarylist <- c("Earn USD 5 per hour with us It is a lot of fun and you only have to work for 6 hours per day. We pay more USD than our competitors.",
                "You can become rich, too. Earn 50.000 Dollar per month and enjoy 60.000 pieces of cake per day. Enjoy Dollar! ",
                "Do you want to earn a lot of $? Then come and work with us. Earn $ 120.000 per year")

library(stringr)
d <- paste0("(?:",paste(dollarnames, collapse="|"), ")")
price <- "\\d+(?:\\.\\d+)?"
rx <- paste0("(?i)(?<=", d, "\\s{0,100})", price, "|", price, "(?=\\s*", d, ")")
str_match(salarylist, rx)

输出:

[1,] "5"      
[2,] "50.000" 
[3,] "120.000"

答案 1 :(得分:1)

如果我们关注“ per”一词,则可以使用以下代码提取:

stringr::str_extract(salarylist , "\\d+.*?per \\w+")

[1] "5 per hour" "50.000 Dollar per month" "120.000 per year"