使用Tidyr Extract和Regex清理包含工资和薪水的杂乱数据框列

时间:2016-05-09 03:55:56

标签: regex r tidyr

用正则表达式挣扎。我是regex的新手,我在下面创建了一个基本的示例数据框。我尝试使用tidyr的提取功能将Hourly.Pay的每小时工资提取到名为Hourly的新列中。

Name <- c("Client1","Client2","Client3","Client4","Client5","Client6","Client7","Client8","Client9","Client10","Client11","Client12","Client13")

Hourly.Pay <- c("$14.00","$14","$20.22","$18.00/Hour","$15","19/hourly","$40,000","$345.00","$1920/month","$11.25","12.75 hr","67K/year","15.25")

Pay<-data.frame(Name,Hourly.Pay)

以下是我到目前为止的正则表达式,它几乎可以工作。我还没有能够捕获前两位数后没有句号的条目。我需要捕获一个可选的美元符号,然后是两个数字,后跟一个句点,一个句点和至少两个以上的数字,或者没有句号或任何其他数字。

Pay2 <- extract(Pay, Hourly.Pay, "Hourly", "^(\\$?\\d{2}\\.\\d*)",remove=FALSE)

帮助将不胜感激。如果可能的话,如果为正则表达式字符提供解释也会很棒。

谢谢!

1 个答案:

答案 0 :(得分:1)

这是一个答案。我冒昧地清理你的数据。这太荒谬了,你需要更好的数据管理。

library(rex)
library(tidyr)
library(dplyr)
library(stringi)

scale = data_frame(scale = c("K", ""),
                   scale_value = c(1000, 1) )

time_unit = data_frame(time_unit = c("", "hr", "hour", "hourly", "month", "year"),
                       time_value = c(1, 1, 1, 1, 40/7*30, 40/7*30*12) )

interpretation = 
  rex("$" %>% maybe,
      some_of(number, ".", ",") %>% capture,
      "K" %>% maybe %>% capture,
      one_of(" ", "/") %>% maybe,
      letters %>% maybe %>% capture )

result = 
  Pay %>%
  extract(Hourly.Pay, 
          c("wage_raw", "scale", "time_unit_raw"), 
          interpretation) %>%
  mutate(wage = wage_raw %>% extract_numeric,
         time_unit = 
           (wage > 10000) %>%
           ifelse("year", time_unit_raw) %>%
           stri_trans_tolower) %>%
  left_join(scale) %>%
  left_join(time_unit) %>%
  mutate(estimated_wage = wage * scale_value / time_value)