我有一个包含以下列结构的表:
Name Type
Urgent Care (Revenue Code: 0456) Per Case
IV Therapy (Revenue Codes 0260, 0269) Per Visit
Oncology Treatment (Revenue Codes: 0280, 0289) Per Visit
我希望从名称列中提取数字收入代码,以便表格如下所示:
Name Rev Code Type
Urgent Care 0456 Per Case
IV Therapy 0260, 0269 Per Visit
Oncology Treatment 0280, 0289 Per Visit
原始数据在名称列中不一致,因为单词" Code"之后是&#34 ;;" ,空白," - "所以我试图使用正则表达式搜索第一个数字,然后在那里拆分列。
我尝试使用正则表达式搜索tidyr包中的第一个数字和separate():
library(tidyr)
separate(mydata, Name, into = c("Name", "Rev Code"), sep = "[[:digit:]]")
将列拆分到正确的位置,但列为" Rev Code"最后空白? 我对R来说比较新,肯定会感激任何帮助!
structure(list(
Name = c("Urgent Care (Revenue Code: 0456)", "IV Therapy (Revenue Codes 0260, 0269)",
"Oncology Treatment (Revenue Codes: 0280, 0289)"),
Type = c("Per Case", "Per Visit", "Per Visit")),
.Names = c("Name", "Type"), row.names = 1:3, class = "data.frame")
答案 0 :(得分:2)
read.table(header=TRUE, stringsAsFactors=FALSE, sep=",", text='Name,Type
"Urgent Care (Revenue Code: 0456)", "Per Case"
"IV Therapy (Revenue Codes 0260, 0269)","Per Visit"
"Oncology Treatment (Revenue Codes: 0280, 0289)", "Per Visit"') -> df
library(stringi)
library(dplyr)
library(purrr)
extract_codes <- function(x) {
stri_match_all_regex(x, "[[:digit:]]+") %>% # extract the numbers
map(~paste0(as.vector(.), collapse=", ")) # paste them back together
}
mutate(df, `Rev Code`=extract_codes(Name))
答案 1 :(得分:1)
我们可以尝试extract
library(tidyr)
extract(df1, Name, into = c("Name", "RevCode"), "([^(]+)\\s*[^0-9]+([0-9].*).")
# Name RevCode Type
#1 Urgent Care 0456 Per Case
#2 IV Therapy 0260, 0269 Per Visit
#3 Oncology Treatment 0280, 0289 Per Visit
由于OP评论说存在其他模式,
extract(df2, Name, into = c("Name", "RevCode"), "([^(]+)\\s*[^0-9]+([0-9].*).")
# Name RevCode Type
#1 Urgent Care 0456 Per Case
#2 IV Therapy 0260, 0269 Per Visit
#3 Oncology Treatment 0280, 0289 Per Visit
#4 Speech Therapy 0440-0444, 0449 Per Visit
df2 <- structure(list(Name = c("Urgent Care (Revenue Code: 0456)",
"IV Therapy (Revenue Codes 0260, 0269)",
"Oncology Treatment (Revenue Codes: 0280, 0289)",
"Speech Therapy (Revenue Codes: 0440-0444, 0449)"
), Type = c("Per Case", "Per Visit", "Per Visit", "Per Visit"
)), .Names = c("Name", "Type"), class = "data.frame", row.names = c(NA,
-4L))
答案 2 :(得分:1)
没有额外的包裹:
> data.frame(Name=gsub("\\(.*\\)", "", df$Name),
RevCode=regmatches(df$Name, regexpr("[[:digit:]]+(\\,[[:space:]][[:digit:]]+)?", df$Name)),
Type=df$Type)
Name RevCode Type
1 Urgent Care 0456 Per Case
2 IV Therapy 0260, 0269 Per Visit
3 Oncology Treatment 0280, 0289 Per Visit