根据R中两个数据帧的匹配来“展平”数据

时间:2015-04-25 07:42:28

标签: r

我有两个数据框如下

第一个是调查表,用于说明何时进行了某人的调查

ID = c('1000021','1000021')
SurveyDate = c('2014-05-30','2013-05-01')
dfsurvey = data.frame(ID,SurveyDate)
> dfsurvey
              ID  SurveyDate
1        1000021  2014-05-30
2        1000021  2013-05-01

第二个是爱好表,它告诉那天记录的人的爱好。在不同的日子里,他的爱好可能会有所不同。

ID = c('1000021','1000021','1000021','1000021','1000021','1000021','1000021')
HobbyName = c('Running','Volleyball','Pingpong','Badminton','Swimming','Running','Pingpong')
SurveyDate = c('2014-05-30','2014-05-30','2014-05-30','2014-05-30','2014-05-30','2013-05-01','2013-05-01')
dfhobby = data.frame(ID,HobbyName,SurveyDate)
> dfhobby
   ID                                      HobbyName  SurveyDate
1        1000021                             Running  2014-05-30
2        1000021                          Volleyball  2014-05-30
3        1000021                            Pingpong  2014-05-30
4        1000021                           Badminton  2014-05-30
5        1000021                            Swimming  2014-05-30
6        1000021                             Running  2013-05-01
7        1000021                            Pingpong  2013-05-01

对于只有两行的调查表,我想添加扩展的爱好列表,每个爱好得到它自己的列,我称之为“展平”。像这样的东西,

#expected final output - add columns to dfsurvey
> dfsurvey
ID     SurveyDate                        Hobby_Running     Hobby_Volleyball     Hobby_Pingpong    Hobby_Badminton Hobby_Swimming
1        1000021                                 1                    1                  1                  1              1 
2        1000021                                 1                    0                  1                  0              0

这是我的代码 我基本上首先构造列名,然后使用嵌套的for循环来标记1对抗业余爱好。但是,这非常非常慢,嵌套for循环的一次迭代大约一秒钟

#making columns and setting them to 0 as default
hobbyvalues = unique(dfhobby$HobbyName)
for(i in 1:length(hobbyvalues))
{
    print(i)
    dfsurvey[paste("Hobby_",hobbyvalues[i],sep="")] = 0
}

#flattening iterative
for(i in 1:nrow(dfsurvey))
{
    print(i)

    listofhobbies = dfhobby[which(dfhobby$ID == dfsurvey[i,"ID"] & dfhobby$SurveyDate == dfsurvey[i,"SurveyDate"]),"HobbyName"]

    if(length(listofhobbies) > 0)
    {
        for(l in 1:length(listofhobbies))
        {
            dfsurvey[i,paste("Hobby_",listofhobbies[l],sep="")] = 1
        }
    }
}

我也尝试了foreach包和doMC包,并且能够并行编写代码。但是,这也很慢。

R中是否有更好的方法或库可以帮助我这样做? 感谢。

2 个答案:

答案 0 :(得分:3)

> library(reshape2)
> dcast(dfhobby,ID*SurveyDate~HobbyName,fill=0,length)

       ID SurveyDate Badminton Pingpong Running Swimming Volleyball
1 1000021 2013-05-01         0        1       1        0          0
2 1000021 2014-05-30         1        1       1        1          1


> dcast(dfhobby,SurveyDate~HobbyName,fill=0,length)

  SurveyDate Badminton Pingpong Running Swimming Volleyball
1 2013-05-01         0        1       1        0          0
2 2014-05-30         1        1       1        1          1

答案 1 :(得分:1)

R包dplyrtidyr就是为了做到这一点。它们也非常快速地处理大型数据集。有关此内容的详细信息,请参阅Rstudio page

中的数据操作备忘单
library(dplyr)
library(tidyr)
df %>% group_by(ID,SurveyDate,HobbyName) %>% 
    mutate(Count = n()) %>% spread(HobbyName ,Count,fill=0)