我有2个数据框。
在df1
中,我有一列国际疾病分类(ICD)诊断代码(df1$PriDiag
),以及其他信息。
#df1
PriDiag = c("A051","A067","A161","A242","A459")
Admissions = c("106","79","67","50","41")
Pts = c("97","27","45","30","20")
df1 = data.frame(PriDiag,Admissions,Pts)
df1
PriDiag Admissions Pts
1 A051 106 97
2 A067 79 27
3 A161 67 45
4 A242 50 30
5 A459 41 20
在其他数据框(df2
)中,我有ICD子类别的开始(df2$Start
)和结束(df2$End
)限制,以及相关说明({{ 1}})。
df2$Description
我想要做的是为#df2
Start = c("A00","A15","A20","A30")
End = c("A09","A19","A28","A49")
Description = c("Intestinal infectious diseases","Tuberculosis","Certain zoonotic bacterial","Other bacterial diseases")
df2 = data.frame(Start,End,Description)
df2
Start End Description
1 A00 A09 Intestinal infectious diseases
2 A15 A19 Tuberculosis
3 A20 A28 Certain zoonotic bacterial diseases
4 A30 A49 Other bacterial diseases
分配一个新列,其中包含代码(df1
)的子类别说明(df2$Description
)。如果代码是数字而不是字符,我将能够做到这一点,但我正在努力找到一个快速的解决方案。有没有在字符之间搜索的方法?
我想要的结果是一个新的数据框df1$PriDiag
,看起来像这样:
df3
我该怎么做?
答案 0 :(得分:0)
试试这个:
library(sqldf)
sqldf("select df1.*, df2.Description
from df1
left join df2
on PriDiag between Start and End"
)
,并提供:
PriDiag Admissions Pts Description
1 A051 106 97 Intestinal infectious diseases
2 A067 79 27 Intestinal infectious diseases
3 A161 67 45 Tuberculosis
4 A242 50 30 Certain zoonotic bacterial
5 A459 41 20 Other bacterial diseases
答案 1 :(得分:0)
这会对您的数据做出一些可能不正确的假设。如果您的数据不像看起来那么直接,可以进行调整,但阻力最小的路径是我最喜欢的。
library(qdap)
## Create a list key based on ranges
key <- setNames(lapply(1:nrow(df2), function(i) {
paste0(strtrim(df2[i, 1], 1),
pad(substring(df2[i, 1], 2):substring(df2[i, 2], 2), 2))
}), df2[, 3])
## Assuming that last digit isn't important use qdap's lookup function (%l%)
df1[, "Description"] <- strtrim(df1[, 1], 3) %l% key
## PriDiag Admissions Pts Description
## 1 A051 106 97 Intestinal infectious diseases
## 2 A067 79 27 Intestinal infectious diseases
## 3 A161 67 45 Tuberculosis
## 4 A242 50 30 Certain zoonotic bacterial
## 5 A459 41 20 Other bacterial diseases