此df1
数据框看起来与我在现实生活中使用的内容非常相似(两列):
df1 <- data.frame(provider = c("LeBron James, MD",
"Peyton Manning, DDS",
"Mike Trout, DO"),
cpt_codes = c("This provider because he bills CPT codes 99284, 99282 and 99285 65% more than his peer group",
"Overutilization of visits per patient for E0781-RR-59 and J1100!",
"High units per patient compared to the specialty for the following:29581: 146.88% 93990: 33.71%"))
print(df1)
# provider cpt_codes
#1 LeBron James, MD This provider because he bills CPT codes 99284, 99282 and 99285 65% more than his peer group
#2 Peyton Manning, DDS Overutilization of visits per patient for E0781-RR-59 and J1100!
#3 Mike Trout, DO High units per patient compared to the specialty for the following:29581: 146.88% 93990: 33.71%
我需要从cpt_codes
字段中提取长度为5(字母数字)字符并以数字(0:9)结尾的所有字符块。然后我需要将它们匹配到provider
字段,其中包含每个provider / cpt_code组合的唯一行。最终结果如下:
# provider cpt_codes
#1 LeBron James, MD 99284
#2 LeBron James, MD 99282
#3 LeBron James, MD 99285
#4 Peyton Manning, DDS E0781
#5 Peyton Manning, DDS J1100
#6 Mike Trout, DO 29581
#7 Mike Trout, DO 93990
通过研究,我发现了一些非常好的堆栈溢出问题和R中的文本字符串答案,这些问题让我可以将下面的解决方案拼凑在一起。这个解决方案让我得到了我想要的东西,但它似乎过于复杂。我期待看到是否还有其他人可以提出“决赛”#39;以更简洁的方式输出。
library(stringr)
#replace all punctuation with spaces in the text strings
df1$cpt_codes <- str_replace_all(df1$cpt_codes, "[[:punct:]]", " ")
#identifies all 5 character blocks in the text strings
t <- str_extract_all(df1$cpt_codes, "\\b[a-zA-Z0-9]{5,5}\\b")
#makes a new data frame that keeps only the 5 character blocks ending in a numeric char
fn <- c(0:9)
cpts <- function(x) {
t1 <- subset(t[[x]], grepl(paste(fn, collapse = "|"), substr(t[[x]], 5, 5)) == TRUE)
data.frame(id = rep(x, length(t1)), cpt_codes = t1)
}
t2 <- do.call("rbind", (lapply(c(1:length(t)), function(x) cpts(x))))
#creates an "id" field on the df1
df1$id <- c(1:nrow(df1))
df3 <- df1[, -2]
final <- merge(df3, t2, by = "id")
final[, -1]
print(final)
# provider cpt_codes
#1 LeBron James, MD 99284
#2 LeBron James, MD 99282
#3 LeBron James, MD 99285
#4 Peyton Manning, DDS E0781
#5 Peyton Manning, DDS J1100
#6 Mike Trout, DO 29581
#7 Mike Trout, DO 93990
答案 0 :(得分:3)
你可以试试这个正则表达式\\b\\w{4}\\d\\b
,此外我认为[[:punct:]]
也是一种单词边界,所以你不必用空格替换它们。
library(dplyr); library(tidyr); library(stringr)
df1 %>% mutate(cpt_codes = str_extract_all(cpt_codes, "\\b\\w{4}\\d\\b")) %>% unnest()
# provider cpt_codes
# 1 LeBron James, MD 99284
# 2 LeBron James, MD 99282
# 3 LeBron James, MD 99285
# 4 Peyton Manning, DDS E0781
# 5 Peyton Manning, DDS J1100
# 6 Mike Trout, DO 29581
# 7 Mike Trout, DO 93990
答案 1 :(得分:2)
这可以在基数R中使用gregexpr()
和regmatches()
完成,如下所示:
cn <- 'cpt_codes';
m <- regmatches(df1[[cn]],gregexpr('[a-zA-Z0-9]{4}[0-9]',as.character(df1[[cn]])));
res <- df1[rep(seq_along(m),lengths(m)),setdiff(names(df1),cn),drop=F];
res[[cn]] <- unlist(m);
res;
## provider cpt_codes
## 1 LeBron James, MD 99284
## 1.1 LeBron James, MD 99282
## 1.2 LeBron James, MD 99285
## 2 Peyton Manning, DDS E0781
## 2.1 Peyton Manning, DDS J1100
## 3 Mike Trout, DO 29581
## 3.1 Mike Trout, DO 93990
答案 2 :(得分:2)
data.table soln
df1 <- data.frame(provider = c("LeBron James, MD",
"Peyton Manning, DDS",
"Mike Trout, DO"),
cpt_codes = c("This provider because he bills CPT codes 99284, 99282 and 99285 65% more than his peer group",
"Overutilization of visits per patient for E0781-RR-59 and J1100!",
"High units per patient compared to the specialty for the following:29581: 146.88% 93990: 33.71%"))
require(data.table)
ddt <- as.data.table(df1)
> library(stringr)
> ddt[,str_extract_all(cpt_codes, "\\b\\w{4}\\d\\b"),provider]
provider V1
1: LeBron James, MD 99284
2: LeBron James, MD 99282
3: LeBron James, MD 99285
4: Peyton Manning, DDS E0781
5: Peyton Manning, DDS J1100
6: Mike Trout, DO 29581
7: Mike Trout, DO 93990