我有一个data.frame
,其中第一行是分数。 2+列的分数可以相同。如何将它们优雅地组合成一个,并为这些列添加 select 行的总和?我用3个for
循环完成了它,但效率非常低。提前谢谢!
df = structure(list(`1542917` = c(21.03, 357, 140, 0, 0.15, 0.06,
0), `1542954` = c(21.07, 353, 7, 0, 0.15, 0.06, 0), `1542904` = c(21.19,
358, 5, 0, 0.15, 0.06, 0), `1542908` = c(21.19, 358, 6, 0, 0.15,
0.06, 0), `1542894` = c(21.37, 358, 2, 0, 0.15, 0.06, 0), `1542895` = c(21.37,
358, 5, 0, 0.15, 0.06, 0), `1542901` = c(21.37, 358, 77, 0, 0.15,
0.06, 1)), .Names = c("1542917", "1542954", "1542904", "1542908",
"1542894", "1542895", "1542901"), row.names = c("Score", "item_count",
"market_count", "3M Post-It Notes 1 ct./pk. ", "7Up Soft Drinks 12 oz. 12 ct./pk. 3/$10.00",
"7Up Soft Drinks 12 oz. 12 ct./pk. 3/$11.00", "Charlottesville, VA"
), class = "data.frame")
我所看到的:
row.names 1542917 1542954 1542904 1542908 1542894 1542895 1542901
1 Score 21.03 21.07 21.19 21.19 21.37 21.37 21.37
2 item_count 357.00 353.00 358.00 358.00 358.00 358.00 358.00
3 market_count 140.00 7.00 5.00 6.00 2.00 5.00 77.00
4 3M Post-It Notes 1 ct./pk. 0.00 0.00 0.00 0.00 0.00 0.00 0.00
5 7Up Soft Drinks 12 oz. 12 ct./pk. 3/$10.00 0.15 0.15 0.15 0.15 0.15 0.15 0.15
6 7Up Soft Drinks 12 oz. 12 ct./pk. 3/$11.00 0.06 0.06 0.06 0.06 0.06 0.06 0.06
7 Charlottesville, VA 0.00 0.00 0.00 0.00 0.00 0.00 1.00
我在追求的是什么(第3行总结,所有其他人都是基于相同分数的第一个实例。保证具有相同分数的列具有相同的数字,除了我要求的market_count
):
row.names 1542917 1542954 1542904 1542894
1 Score 21.03 21.07 21.19 21.37
2 item_count 357.00 353.00 358.00 358.00
3 market_count 140.00 7.00 11.00 84.00
4 3M Post-It Notes 1 ct./pk. 0.00 0.00 0.00 0.00
5 7Up Soft Drinks 12 oz. 12 ct./pk. 3/$10.00 0.15 0.15 0.15 0.15
6 7Up Soft Drinks 12 oz. 12 ct./pk. 3/$11.00 0.06 0.06 0.06 0.06
7 Charlottesville, VA 0.00 0.00 0.00 1.00
编辑 - 我的笨拙解决方案。问题是我有10k +行,而且反正非常缓慢和不优雅
Score = c(63.69, 27.31, 31.99, 25.41, 26.61, 28.35, 83.91, 22.59, 26.61,
21.73, 27.11, 26.99, 21.55, 26.99, 22.01, 21.93, 21.99, 24.39,
24.39, 25.31, 22.05, 21.55, 22.01, 22.33, 21.37, 21.37, 21.67,
26.13, 22.55, 27.11, 21.99, 21.37, 21.81, 20.71, 21.19, 21.87,
22.59, 29.61, 21.19, 27.21, 38.91, 28.81, 65.89, 28.71, 22.99,
39.85, 21.63, 21.03, 39.85, 29.41, 38.89, 34.87, 26.83, 30.85,
22.05, 28.05, 46.75, 27.31, 21.39, 21.73, 26.79, 21.55, 21.39,
29.17, 23.19, 21.07, 23.19, 21.73, 26.07, 22.01, 22.39, 46.47,
25.41, 21.39, 27.11, 21.55, 26.79, 21.87, 21.73, 21.55, 22.03,
22.35, 26.79, 27.31, 27.49, 27.11, 27.75, 26.13, NA)
un_score = unique(sort(Score))
print(un_score)
sum_mark = matrix(0,ncol=length(un_score),nrow=nrow(df))
for (i in 1:length(un_score)) {
for (j in 1:ncol(df)) {
for (k in 1:nrow(df)) {
if (df[1,j] == un_score[i]) {
if (k<3 | (k > 3 & k <= length(prod)+3)) sum_mark[k,i] = unique(df[k,j])
else sum_mark[k,i] = sum_mark[k,i] + df[k,j]
}
}
}
}
View(sum_mark)
在此示例中考虑length(prod) = 3
答案 0 :(得分:2)
使用基数R:
df1=setNames(df,df[1,])
df2=transform(stack(df1),i=rownames(df)[c(row(df1))],ind=sub("([.]\\d+).*","\\1",ind))
df3=aggregate(values~.,df2,function(x)sum(unique(x)))
reshape(df3,timevar = "ind",idvar = "i",dir="wide")
i values.21.03 values.21.07 values.21.19 values.21.37
1 3M Post-It Notes 1 ct./pk. 0.00 0.00 0.00 0.00
5 7Up Soft Drinks 12 oz. 12 ct./pk. 3/$10.00 0.15 0.15 0.15 0.15
9 7Up Soft Drinks 12 oz. 12 ct./pk. 3/$11.00 0.06 0.06 0.06 0.06
13 Charlottesville, VA 0.00 0.00 0.00 1.00
17 item_count 357.00 353.00 358.00 358.00
21 market_count 140.00 7.00 11.00 84.00
25 Score 21.03 21.07 21.19 21.37
或者你可以这样做:
xtabs(values~i+ind,df3)
ind
i 21.03 21.07 21.19 21.37
3M Post-It Notes 1 ct./pk. 0.00 0.00 0.00 0.00
7Up Soft Drinks 12 oz. 12 ct./pk. 3/$10.00 0.15 0.15 0.15 0.15
7Up Soft Drinks 12 oz. 12 ct./pk. 3/$11.00 0.06 0.06 0.06 0.06
Charlottesville, VA 0.00 0.00 0.00 1.00
item_count 357.00 353.00 358.00 358.00
market_count 140.00 7.00 11.00 84.00
Score 21.03 21.07 21.19 21.37
如果你需要以上作为data.frame:
as.data.frame.matrix(xtabs(values~i+ind,df3))
21.03 21.07 21.19 21.37
3M Post-It Notes 1 ct./pk. 0.00 0.00 0.00 0.00
7Up Soft Drinks 12 oz. 12 ct./pk. 3/$10.00 0.15 0.15 0.15 0.15
7Up Soft Drinks 12 oz. 12 ct./pk. 3/$11.00 0.06 0.06 0.06 0.06
Charlottesville, VA 0.00 0.00 0.00 1.00
item_count 357.00 353.00 358.00 358.00
market_count 140.00 7.00 11.00 84.00
Score 21.03 21.07 21.19 21.37
使用data.table:
library(data.table)
dcast(setDT(melt(df))[,ind:=as.character(value[1]),by=variable][,
c("i","variable"):=.(c(row(df)),NULL)][,sum(unique(value)),by=.(i,ind)],i~ind,value.var="V1")
No id variables; using all as measure variables
i 21.03 21.07 21.19 21.37
1: 1 21.03 21.07 21.19 21.37
2: 2 357.00 353.00 358.00 358.00
3: 3 140.00 7.00 11.00 84.00
4: 4 0.00 0.00 0.00 0.00
5: 5 0.15 0.15 0.15 0.15
6: 6 0.06 0.06 0.06 0.06
7: 7 0.00 0.00 0.00 1.00
答案 1 :(得分:1)
请注意,结合了转置,tapply
和merge
。我删除了第三行(包含市场计数)以删除数据框中的非唯一行。
df2 <- as.data.frame(t(df))
x <- factor(df2$Score)
y <- tapply(df2$market_count, x, sum)
ns <- names(y)
df3 <- data.frame(Score = ns, market_count_sum = y)
df4 <- unique(merge(df2[, -3], df3))
as.data.frame(t(df4))