我们说我有以下data.frame,它将R包的名称与它所属的CRAN任务视图相关联:
dictionary <- data.frame(task.view = c(rep("High.Performance.Computing", 3), rep("Machine.Learning", 3)), package = c("Rcpp", "HadoopStreaming", "rJava", "e1071", "nnet", "RWeka"))
# task.view package
# High.Performance.Computing Rcpp
# High.Performance.Computing HadoopStreaming
# High.Performance.Computing rJava
# Machine.Learning e1071
# Machine.Learning nnet
# Machine.Learning RWeka
然后我计算从学生写的四个工具之一调用每个包的次数:
package.referals <- data.frame(Rcpp = c(1, 0, 1, 1), HadoopStreaming = c(1, 0, 0, 0), rJava = c(1, 0, 0, 1), e1071 = c(1, 1, 1, 1), nnet = c(1, 0, 0, 0), RWeka = c(1, 0, 0, 1), row.names = paste("student pkg", 1:4))
# Rcpp HadoopStreaming rJava e1071 nnet RWeka
# student pkg 1 1 1 1 1 1 1
# student pkg 2 0 0 0 1 0 0
# student pkg 3 1 0 0 1 0 0
# student pkg 4 1 0 1 1 0 1
如何根据包任务视图关系的data.frame重构上面的package.referals data.frame的列?
E.g。我想输出
data.frame(High.Performance.Computing = c(3, 0, 1, 2), Machine.Learning = c(3, 1, 1, 2), row.names = paste("student pkg", 1:4))
# High.Performance.Computing Machine.Learning
# student pkg 1 3 3
# student pkg 2 0 1
# student pkg 3 1 1
# student pkg 4 2 2
我尝试了以下但是在尝试将其重组为我想要的输出(求和和转置)时我遇到了困难:
require(data.table)
# column names of package.referals data.frame
package.referals.colnames <- names(package.referals)
# a data.table of my task view and package relations, keyed by package name
dictionary.dt <- data.table(dictionary, key = "package")
# a data.table of my package.referals data.frame, transposed, and keyed by package name
package.referals.dt <- data.table(package = package.referals.colnames, t(package.referals), key="package")
# Joining data.tables so that the package name and corresponding task view are on the same line
dt <- package.referals.dt[J(dictionary.dt)]
setkey(dt, "task.view")
# package student pkg 1 student pkg 2 student pkg 3 student pkg 4 task.view
# 1: HadoopStreaming 1 0 0 0 High.Performance.Computing
# 2: Rcpp 1 0 1 1 High.Performance.Computing
# 3: rJava 1 0 0 1 High.Performance.Computing
# 4: e1071 1 1 1 1 Machine.Learning
# 5: nnet 1 0 0 0 Machine.Learning
# 6: RWeka 1 0 0 1 Machine.Learning
答案 0 :(得分:4)
以下是reshape
和基础R:
package.referals$id <- rownames(package.referals)
pkgr <- melt(package.referals, variable.name="package")
pkgr <- pkgr[pkgr$value>0,]
df <- merge(pkgr, dictionary, all.x=TRUE)
table(df$id, df$task.view)
如果您真的想使用data.table
代替merge
,可以用以下代码替换最后的第三行:
pkgr <- data.table(pkgr, key="package")
dictionary <- data.table(dictionary, key="package")
df <- pkgr[dictionary]
table(df$id, df$task.view)
答案 1 :(得分:2)
您可以匹配并重命名package.referals
的列,然后在名称相同的列上执行rowSums
...
names( package.referals ) <- dictionary$task.view[ match( names( package.referals ) , dictionary$package ) ]
sapply( unique( names( package.referals ) ) , function(x) rowSums( package.referals[ , names( package.referals ) %in% x ] ) )
# High.Performance.Computing Machine.Learning
#student pkg 1 3 3
#student pkg 2 0 1
#student pkg 3 1 1
#student pkg 4 2 2
答案 2 :(得分:2)
您还可以在一个data.frame
中插入所有信息,然后aggregate
:
dictionary <- data.frame(task.view = c(rep("High.Performance.Computing", 3), rep("Machine.Learning", 3)), package = c("Rcpp", "HadoopStreaming", "rJava", "e1071", "nnet", "RWeka"))
package.referals <- data.frame(Rcpp = c(1, 0, 1, 1), HadoopStreaming = c(1, 0, 0, 0), rJava = c(1, 0, 0, 1), e1071 = c(1, 1, 1, 1), nnet = c(1, 0, 0, 0), RWeka = c(1, 0, 0, 1), row.names = paste("student pkg", 1:4))
pack.ref <- as.data.frame(t(package.referals)) #transpose for easier manipulation
pack.ref$task.view <- as.character(dictionary$task.view[unlist(lapply(colnames(package.referals), grep, dictionary$package))]) #add column with "task.view" of each package (here is obvious)
DF <- as.data.frame(t(aggregate(pack.ref[,1:4], by = list(pack.ref$task.view), sum))) #"aggregate"
DF
# V1 V2
#Group.1 High.Performance.Computing Machine.Learning
#student pkg 1 3 3
#student pkg 2 0 1
#student pkg 3 1 1
#student pkg 4 2 2