想要将数据从长格式转换为宽格式。整体而言,ColA只需要一排。在我尝试按计数聚合的情况下,ColB中会出现重复的ColE。 ColF由sum()聚合。
s <- read_csv("sample.csv")
s_1 <- subset(s, select=c("ColA", "ColF"))
grp_by <- aggregate(. ~ ColA , data = s_1, FUN = sum)
head(grp_by)
不确定如何转换其余列
更新:根据使用reshape2包的建议
library(reshape2)
s <- read_csv("sample.csv")
s_1 <- subset(s, select=c("ColA", "ColF"))
grp_by <- aggregate(. ~ ColA , data = s_1, FUN = sum)
s2 <- dcast(s, ColA ~ ColB)
s3 <- dcast(s, ColA ~ ColC)
s4 <- dcast(s, ColA ~ ColD)
s5 <- dcast(s, ColA ~ ColE)
print(s2)
print(s3)
print(s4)
print(s5)
print(grp_by)
这是那些打印语句的输出。
如何将所有这些合并到一个数据框中?我的实际数据集是100万条记录 - 这段代码是否足以在其上运行,或者是否有更好的编写方式。谢谢你的帮助。
答案 0 :(得分:0)
这是我用来转换和合并数据的示例代码。可能会有一些更好的方法,但这是我能想到的最好方法。
# Include needed libraries
library(reshape2)
# Load the sample data
s <- read_csv("sample.csv")
# Aggregate ColF by SUM for each ColA
s_1 <- subset(s, select=c("ColA", "ColF"))
grp_by <- aggregate(. ~ ColA , data = s_1, FUN = sum)
# Long to Wide format
s2 <- dcast(s, ColA ~ ColB)
s3 <- dcast(s, ColA ~ ColC)
s4 <- dcast(s, ColA ~ ColD)
s5 <- dcast(s, ColA ~ ColE)
# But this is the crude way of removing NA columns which I used!
# Rename the NA column into something so that it can be removed by assigning NULL!!
colnames(s2)[7] <- "RemoveMe"
colnames(s3)[5] <- "RemoveMe"
colnames(s4)[5] <- "RemoveMe"
colnames(s5)[4] <- "RemoveMe"
s2$RemoveMe <- NULL
s3$RemoveMe <- NULL
s4$RemoveMe <- NULL
s5$RemoveMe <- NULL
# Merge all pieces to form the final transformed data
s2 <- merge(x = s2, y = s3, by="ColA", all = TRUE)
s2 <- merge(x = s2, y = s4, by="ColA", all = TRUE)
s2 <- merge(x = s2, y = s5, by="ColA", all = TRUE)
s2 <- merge(x = s2, y = grp_by, by="ColA", all = TRUE)
# Removing the row with user_id = NA!!
s2 <- s2[-c(4), ]
# Final transformed data
print(s2)
将这些用作参考: