我的数据如下:
HH_ID INDUSTRY FREQUENCY
102 CLERK 4
102 NURSE 2
102 NOT APPLICABLE 2
103 NURSE 3
103 NOT APPLICABLE 1
104 NOT APPLICABLE 2
104 NOT APPLICABLE 1
我想仅删除具有其他值的NOT_APPLICABLE
HH_ID
,例如与之关联的CLERK
或NURSE
。我想要一个看起来像这样的输出:
HH_ID INDUSTRY FREQUENCY
102 CLERK 4
102 NURSE 2
103 NURSE 3
104 NOT APPLICABLE 2
我想在R中使用上述类型的输出。我已经尝试了数据。
答案 0 :(得分:1)
您可以按HH_ID
拆分数据,并将每个部分的子集仅包含您希望在数据中包含的值:
d <- data.frame(HH_ID = c(rep(102,3), 103, 103, 104, 104), INDUSTRY = factor(c('CLERK', 'NURSE', 'NOT APPLICABLE', 'NURSE', rep('NOT APPLICABLE', 3))), FREQUENCY = c(4,2,2,3,1,2,1))
library(plyr)
d2 <- ldply(split(d, d$HH_ID), function(d_tmp) {
if(all(d_tmp$INDUSTRY == 'NOT APPLICABLE')) {
d_tmp[1,]
} else {
d_tmp[d_tmp$INDUSTRY != 'NOT APPLICABLE',]
}
})[,-1]
...应该是您想要的数据:
> print(d2)
HH_ID INDUSTRY FREQUENCY
1 102 CLERK 4
2 102 NURSE 2
3 103 NURSE 3
4 104 NOT APPLICABLE 2
PS:如果只有NOT_APPLICABLE
与某个HH_ID
相关联,您似乎也希望将其所有实例折叠为单个实例。如果不为真,请在上面的d_tmp[1,]
中与d_tmp
交换if(){...}
。
答案 1 :(得分:1)
使用data.table
library(data.table)
setDT(df)
# get a subset of the data that is different from "NOT APPLICABLE"
df1 <- df[INDUSTRY != "NOT APPLICABLE"]
# subset only "NOT APPLICABLE" rows where HH_ID is not present in df1 and keep only the row with highest FREQUENCY
df2 <- df[INDUSTRY == "NOT APPLICABLE"][!(HH_ID %in% df1$HH_ID)][max(FREQUENCY)]
# bind the two data sets
output <- rbind(df1, df2)
output
#> HH_ID INDUSTRY FREQUENCY
#> 1: 102 CLERK 4
#> 2: 102 NURSE 2
#> 3: 103 NURSE 3
#> 4: 104 NOT APPLICABLE 2