用正则表达式挣扎。我是regex的新手,我在下面创建了一个基本的示例数据框。我尝试使用tidyr
的提取功能将Hourly.Pay
的每小时工资提取到名为Hourly
的新列中。
Name <- c("Client1","Client2","Client3","Client4","Client5","Client6","Client7","Client8","Client9","Client10","Client11","Client12","Client13")
Hourly.Pay <- c("$14.00","$14","$20.22","$18.00/Hour","$15","19/hourly","$40,000","$345.00","$1920/month","$11.25","12.75 hr","67K/year","15.25")
Pay<-data.frame(Name,Hourly.Pay)
以下是我到目前为止的正则表达式,它几乎可以工作。我还没有能够捕获前两位数后没有句号的条目。我需要捕获一个可选的美元符号,然后是两个数字,后跟一个句点,一个句点和至少两个以上的数字,或者没有句号或任何其他数字。
Pay2 <- extract(Pay, Hourly.Pay, "Hourly", "^(\\$?\\d{2}\\.\\d*)",remove=FALSE)
帮助将不胜感激。如果可能的话,如果为正则表达式字符提供解释也会很棒。
谢谢!
答案 0 :(得分:1)
这是一个答案。我冒昧地清理你的数据。这太荒谬了,你需要更好的数据管理。
library(rex)
library(tidyr)
library(dplyr)
library(stringi)
scale = data_frame(scale = c("K", ""),
scale_value = c(1000, 1) )
time_unit = data_frame(time_unit = c("", "hr", "hour", "hourly", "month", "year"),
time_value = c(1, 1, 1, 1, 40/7*30, 40/7*30*12) )
interpretation =
rex("$" %>% maybe,
some_of(number, ".", ",") %>% capture,
"K" %>% maybe %>% capture,
one_of(" ", "/") %>% maybe,
letters %>% maybe %>% capture )
result =
Pay %>%
extract(Hourly.Pay,
c("wage_raw", "scale", "time_unit_raw"),
interpretation) %>%
mutate(wage = wage_raw %>% extract_numeric,
time_unit =
(wage > 10000) %>%
ifelse("year", time_unit_raw) %>%
stri_trans_tolower) %>%
left_join(scale) %>%
left_join(time_unit) %>%
mutate(estimated_wage = wage * scale_value / time_value)