我有两个数据框如下
第一个是调查表,用于说明何时进行了某人的调查
ID = c('1000021','1000021')
SurveyDate = c('2014-05-30','2013-05-01')
dfsurvey = data.frame(ID,SurveyDate)
> dfsurvey
ID SurveyDate
1 1000021 2014-05-30
2 1000021 2013-05-01
第二个是爱好表,它告诉那天记录的人的爱好。在不同的日子里,他的爱好可能会有所不同。
ID = c('1000021','1000021','1000021','1000021','1000021','1000021','1000021')
HobbyName = c('Running','Volleyball','Pingpong','Badminton','Swimming','Running','Pingpong')
SurveyDate = c('2014-05-30','2014-05-30','2014-05-30','2014-05-30','2014-05-30','2013-05-01','2013-05-01')
dfhobby = data.frame(ID,HobbyName,SurveyDate)
> dfhobby
ID HobbyName SurveyDate
1 1000021 Running 2014-05-30
2 1000021 Volleyball 2014-05-30
3 1000021 Pingpong 2014-05-30
4 1000021 Badminton 2014-05-30
5 1000021 Swimming 2014-05-30
6 1000021 Running 2013-05-01
7 1000021 Pingpong 2013-05-01
对于只有两行的调查表,我想添加扩展的爱好列表,每个爱好得到它自己的列,我称之为“展平”。像这样的东西,
#expected final output - add columns to dfsurvey
> dfsurvey
ID SurveyDate Hobby_Running Hobby_Volleyball Hobby_Pingpong Hobby_Badminton Hobby_Swimming
1 1000021 1 1 1 1 1
2 1000021 1 0 1 0 0
这是我的代码 我基本上首先构造列名,然后使用嵌套的for循环来标记1对抗业余爱好。但是,这非常非常慢,嵌套for循环的一次迭代大约一秒钟
#making columns and setting them to 0 as default
hobbyvalues = unique(dfhobby$HobbyName)
for(i in 1:length(hobbyvalues))
{
print(i)
dfsurvey[paste("Hobby_",hobbyvalues[i],sep="")] = 0
}
#flattening iterative
for(i in 1:nrow(dfsurvey))
{
print(i)
listofhobbies = dfhobby[which(dfhobby$ID == dfsurvey[i,"ID"] & dfhobby$SurveyDate == dfsurvey[i,"SurveyDate"]),"HobbyName"]
if(length(listofhobbies) > 0)
{
for(l in 1:length(listofhobbies))
{
dfsurvey[i,paste("Hobby_",listofhobbies[l],sep="")] = 1
}
}
}
我也尝试了foreach包和doMC包,并且能够并行编写代码。但是,这也很慢。
R中是否有更好的方法或库可以帮助我这样做? 感谢。
答案 0 :(得分:3)
> library(reshape2)
> dcast(dfhobby,ID*SurveyDate~HobbyName,fill=0,length)
ID SurveyDate Badminton Pingpong Running Swimming Volleyball
1 1000021 2013-05-01 0 1 1 0 0
2 1000021 2014-05-30 1 1 1 1 1
> dcast(dfhobby,SurveyDate~HobbyName,fill=0,length)
SurveyDate Badminton Pingpong Running Swimming Volleyball
1 2013-05-01 0 1 1 0 0
2 2014-05-30 1 1 1 1 1
答案 1 :(得分:1)
R包dplyr
和tidyr
就是为了做到这一点。它们也非常快速地处理大型数据集。有关此内容的详细信息,请参阅Rstudio page
library(dplyr)
library(tidyr)
df %>% group_by(ID,SurveyDate,HobbyName) %>%
mutate(Count = n()) %>% spread(HobbyName ,Count,fill=0)