使用ddply进行数据帧操作

时间:2019-03-16 19:32:46

标签: r dataframe plyr

我有一个名为output的datframe output dataframe

我想为每个code生成模式(最重复)patientID,并为每个patientID生成唯一code的计数,并带有上面的zipcode }。

我尝试过:

ddply(output,~zipcode,summarize,max=mode(code))

此代码将为每个不同的code生成zipcode模式...但是我想为不同的{{1}中的不同code生成patientID模式}。

zipcode

output=data.frame(code=c("E78.5","N08","E78.5","I65.29","Z68.29","D64.9"),patientID=c("34423","34423","34423","34423","34424","34425"),zipcode=c(00718,00718,00718,00718,00718,00719),city=c("NAGUABO","NAGUABO","NAGUABO","NAGUABO","NAGUABO","NAGUABO"))

1 个答案:

答案 0 :(得分:0)

如果我正确理解您需要以codepatientID找到频率最高的zipcode,那么可能会使用dplyr。我认为您只需要将以上3列作为分组变量,然后使用summarise来获取每个组的计数。每行最高的是模式。新列提供了模式计数。

# Your reprex data
output=data.frame(code=c("E78.5","N08","E78.5","I65.29","Z68.29","D64.9"),patientID=c("34423","34423","34423","34423","34424","34425"),zipcode=c(00718,00718,00718,00718,00718,00719),city=c("NAGUABO","NAGUABO","NAGUABO","NAGUABO","NAGUABO","NAGUABO")) 

library(dplyr)
output %>% 
  dplyr::group_by(patientID, code, zipcode) %>% 
  dplyr::summarise(mode_freq = n())

# A tibble: 5 x 4
# Groups:   patientID, code [5]
  patientID code   zipcode  freq
<fct>     <fct>    <dbl> <int>
1 34423     E78.5      718     2
2 34423     I65.29     718     1
3 34423     N08        718     1
4 34424     Z68.29     718     1
5 34425     D64.9      719     1

我之所以包含dplyr::是因为我假设您已经加载了plyr,因此函数名将发生冲突。

更新:

要获得建议的模式输出,按照定义,它应该是最高频率:

output %>% 
  group_by(patientID, code, zipcode) %>% 
  summarise(mode_freq = n()) %>%
  ungroup() %>% 
  group_by(zipcode) %>% 
  filter(mode_freq == max(mode_freq))

# A tibble: 2 x 4
# Groups:   zipcode [2]
  patientID code  zipcode mode_freq
<fct>     <fct>   <dbl>     <int>
1 34423     E78.5     718         2
2 34425     D64.9     719         1