我们有一个data.frame d2_cleaned
> dput(d2_cleaned[1:3,c(2,4,5)])
structure(list(Customer.ID = c(110531L, 110531L, 110531L), Time.Spent.Watching = c(16032,
10919, 236), Video.ID.v26 = c(3661L, 4313L, 3661L)), .Names = c("Customer.ID",
"Time.Spent.Watching", "Video.ID.v26"), row.names = c(333515L,
333516L, 333522L), class = "data.frame")
>
我们有另一个df = distinct_customers_after_cleaning
,第一列是唯一的User.ID
,因此每个用户都会被代表一次。其余的列(忽略col2)都是独特的电影。 df看起来像这样:
> dput(distinct_customers_after_cleaning[1:5,1:5])
structure(list(Customer.ID = c(110531L, 318721L, 468491L, 568071L,
1390371L), Hits.After.Cleaning = c(58L, 44L, 98L, 6L, 5L), `3661` = c(0,
0, 0, 0, 0), `4313` = c(0, 0, 0, 0, 0), `3661.1` = c(0, 0, 0,
0, 0)), .Names = c("Customer.ID", "Hits.After.Cleaning", "3661",
"4313", "3661.1"), row.names = c(NA, 5L), class = "data.frame")
>
我想要的是填写df distinct_customers_after_cleaning的值。为此,我想获取d2_cleaned的每一行Time.Spent.Watching并将其汇总到distinct_customers_after_cleaning的正确位置。要找到合适的地方,我需要匹配用户和电影ID:if(d2_cleaned [i,' Customer.ID'] == distinct_customers_after_cleaning [j,' Customer.ID']) if(d2_cleaned [i,' Video.ID.v26'] ==姓名(distinct_customers_after_cleaning [y])) 这是我使用的for循环:
#fill in rows
for (j in 1 : 10000) {
print(j)
for (i in 1 : nrow(d2_cleaned)) {
if (d2_cleaned[i, 'Customer.ID'] == distinct_customers_after_cleaning[j, 'Customer.ID']) {
for (y in 1:ncol(distinct_customers_after_cleaning)) {
if (d2_cleaned[i, 'Video.ID.v26'] == names(distinct_customers_after_cleaning[y])) {
distinct_customers_after_cleaning[j, y] <- distinct_customers_after_cleaning[j, y] + d2_cleaned[i,'Time.Spent.Watching']
}
}
}
}
}
虽然这段代码可以按照我的意愿运行,但速度非常慢(需要4天才能完成所有数据)。您能否推荐一个更好的解决方案,可能包括aggregate
?
答案 0 :(得分:0)
您可以使用dplyr和tidyr来实现目标..有一个例子......
library(dplyr)
library(tidyr)
df <- data.frame(CUSTOMERS = c('u1','u1','u2','u1','u2','u3'),
VIDEOS = c('v1','v1', 'v1', 'v2','v3','v2'),
TIME = c(7, 12, 9, 2, 6, 4))
df_summary <- df %>% group_by(CUSTOMERS, VIDEOS) %>% summarise(TIME_SUM = sum(TIME))
df_summary_spread <- df_summary %>% spread(key = 'VIDEOS', value = 'TIME_SUM')
<强> DF:强>
CUSTOMERS VIDEOS TIME
1 u1 v1 7
2 u1 v1 12
3 u2 v1 9
4 u1 v2 2
5 u2 v3 6
6 u3 v2 4
<强> df_summary:强>
CUSTOMERS VIDEOS TIME_SUM
<fct> <fct> <dbl>
1 u1 v1 19.
2 u1 v2 2.
3 u2 v1 9.
4 u2 v3 6.
5 u3 v2 4.
<强> df_summary_spread:强>
CUSTOMERS v1 v2 v3
<fct> <dbl> <dbl> <dbl>
1 u1 19. 2. NA
2 u2 9. NA 6.
3 u3 NA 4. NA