使用R从列中的字符串中提取数值

时间:2016-10-14 15:20:03

标签: r string split

我有一个包含以下列结构的表:

Name                                                Type
Urgent Care (Revenue Code: 0456)                    Per Case
IV Therapy (Revenue Codes 0260, 0269)               Per Visit
Oncology Treatment (Revenue Codes: 0280, 0289)      Per Visit

我希望从名称列中提取数字收入代码,以便表格如下所示:

Name                     Rev Code      Type
Urgent Care              0456          Per Case
IV Therapy               0260, 0269    Per Visit
Oncology Treatment       0280, 0289    Per Visit

原始数据在名称列中不一致,因为单词" Code"之后是&#34 ;;" ,空白," - "所以我试图使用正则表达式搜索第一个数字,然后在那里拆分列。

我尝试使用正则表达式搜索tidyr包中的第一个数字和separate():

library(tidyr)
separate(mydata, Name, into = c("Name", "Rev Code"), sep = "[[:digit:]]")

将列拆分到正确的位置,但列为" Rev Code"最后空白? 我对R来说比较新,肯定会感激任何帮助!

数据:

structure(list(
Name = c("Urgent Care (Revenue Code: 0456)", "IV Therapy (Revenue Codes 0260, 0269)", 
"Oncology Treatment (Revenue Codes: 0280, 0289)"), 
Type = c("Per Case", "Per Visit", "Per Visit")), 
.Names = c("Name", "Type"), row.names = 1:3, class = "data.frame")

3 个答案:

答案 0 :(得分:2)

read.table(header=TRUE, stringsAsFactors=FALSE, sep=",", text='Name,Type
"Urgent Care (Revenue Code: 0456)", "Per Case"
"IV Therapy (Revenue Codes 0260, 0269)","Per Visit"
"Oncology Treatment (Revenue Codes: 0280, 0289)", "Per Visit"') -> df

library(stringi)
library(dplyr)
library(purrr)

extract_codes <- function(x) {
  stri_match_all_regex(x, "[[:digit:]]+") %>% # extract the numbers
    map(~paste0(as.vector(.), collapse=", ")) # paste them back together
}

mutate(df, `Rev Code`=extract_codes(Name))

答案 1 :(得分:1)

我们可以尝试extract

library(tidyr)
extract(df1, Name, into = c("Name", "RevCode"), "([^(]+)\\s*[^0-9]+([0-9].*).")

#               Name    RevCode      Type
#1        Urgent Care       0456  Per Case
#2         IV Therapy 0260, 0269 Per Visit
#3 Oncology Treatment 0280, 0289 Per Visit

由于OP评论说存在其他模式,

extract(df2, Name, into = c("Name", "RevCode"), "([^(]+)\\s*[^0-9]+([0-9].*).")
#                 Name         RevCode      Type
#1        Urgent Care             0456  Per Case
#2         IV Therapy       0260, 0269 Per Visit
#3 Oncology Treatment       0280, 0289 Per Visit
#4     Speech Therapy  0440-0444, 0449 Per Visit

数据

df2 <- structure(list(Name = c("Urgent Care (Revenue Code: 0456)", 
 "IV Therapy (Revenue Codes 0260, 0269)", 
"Oncology Treatment (Revenue Codes: 0280, 0289)", 
"Speech Therapy (Revenue Codes: 0440-0444, 0449)"
), Type = c("Per Case", "Per Visit", "Per Visit", "Per Visit"
)), .Names = c("Name", "Type"), class = "data.frame", row.names = c(NA, 
-4L))

答案 2 :(得分:1)

没有额外的包裹:

> data.frame(Name=gsub("\\(.*\\)", "", df$Name),
            RevCode=regmatches(df$Name, regexpr("[[:digit:]]+(\\,[[:space:]][[:digit:]]+)?", df$Name)),
            Type=df$Type)
                 Name    RevCode      Type
1        Urgent Care        0456  Per Case
2         IV Therapy  0260, 0269 Per Visit
3 Oncology Treatment  0280, 0289 Per Visit