我想将其中每行代表一个用户和该用户的功能的csv转换为数据表。每个用户都有多个行,每个行都描述了有关该用户的一个方面。例如,
+---------+---------+
| User Id | Feature |
+---------+---------+
| user_1 | male |
| user_2 | female |
| user_1 | teen |
| user_2 | adult |
+---------+---------+
我想要的输出看起来像这样:
+---------+-------+--------+-------+-------+
| User Id | male | female | teen | adult |
+---------+-------+--------+-------+-------+
| user_1 | TRUE | FALSE | TRUE | FALSE |
| user_2 | FALSE | TRUE | FALSE | TRUE |
+---------+-------+--------+-------+-------+
下面的代码是我最初想到的。不幸的是,R在处理期间内存不足。
data <- fread( file="input.csv",
col.names=c("userId","feature"),
colClasses=c("string", "string"),
showProgress=TRUE,
key=c("userId","feature")
)
normalizeFunction <- function(featureForOne) {
as.list(!is.na(match(allFeatures, featureForOne)))
}
allFeatures = data[, unique(feature)]
normalizedData = data[ , c(allFeatures) := normalizeFunction(role) , keyby=.(userId)]
最后,我不得不解决在for循环中逐个处理每个用户的问题。虽然,我觉得我没有利用data.table。有人可以评论我的解决方案吗?
allUsers = unique(data$userId)
normalizedData <- foreach (user = allUsers, .combine=rbind) %do% {
featuresForUser = data[ userId == user ]
featuresForUser [ , normalizeFunction(feature), by=.(userId) ]
}
names(normalizedData, c("userId", allFeatures))
答案 0 :(得分:1)
我想是这样的
x <- fread('
User_Id Feature
user_1 male
user_2 female
user_1 teen
user_2 adult ')
我们将在下面的fun=any
中使用的临时变量:
x[,a:=TRUE,]
实际扩大范围:
dcast(x, User_Id ~ Feature, fun=any, value.var="a")
# User_Id adult female male teen
# 1: user_1 FALSE FALSE TRUE TRUE
# 2: user_2 TRUE TRUE FALSE FALSE
答案 1 :(得分:1)
一种tidyverse
方法:
library(tidyverse)
df %>% mutate(tmp = TRUE) %>% spread(Feature, tmp) %>% replace(., is.na(.), FALSE)
输出:
User_Id adult female male teen
1 user_1 FALSE FALSE TRUE TRUE
2 user_2 TRUE TRUE FALSE FALSE