Question

我想将其中每行代表一个用户和该用户的功能的csv转换为数据表。每个用户都有多个行，每个行都描述了有关该用户的一个方面。例如，

+---------+---------+
| User Id | Feature |
+---------+---------+
| user_1  | male    |
| user_2  | female  |
| user_1  | teen    |
| user_2  | adult   |
+---------+---------+

我想要的输出看起来像这样：

+---------+-------+--------+-------+-------+
| User Id | male  | female | teen  | adult |
+---------+-------+--------+-------+-------+
| user_1  | TRUE  | FALSE  | TRUE  | FALSE |
| user_2  | FALSE | TRUE   | FALSE | TRUE  |
+---------+-------+--------+-------+-------+

下面的代码是我最初想到的。不幸的是，R在处理期间内存不足。

data <- fread( file="input.csv", 
               col.names=c("userId","feature"), 
               colClasses=c("string", "string"), 
               showProgress=TRUE,
               key=c("userId","feature")
              )


normalizeFunction <- function(featureForOne) {
      as.list(!is.na(match(allFeatures, featureForOne)))
} 

allFeatures = data[, unique(feature)]

normalizedData = data[ , c(allFeatures) := normalizeFunction(role) , keyby=.(userId)]

最后，我不得不解决在for循环中逐个处理每个用户的问题。虽然，我觉得我没有利用data.table。有人可以评论我的解决方案吗？

allUsers = unique(data$userId)

normalizedData <- foreach (user = allUsers, .combine=rbind) %do% {
  featuresForUser = data[ userId == user ]
  featuresForUser [ , normalizeFunction(feature), by=.(userId) ]
}

names(normalizedData, c("userId", allFeatures))

Answer 1

我想是这样的

x <- fread('
User_Id Feature
user_1  male   
user_2  female 
user_1  teen   
user_2  adult  ')

我们将在下面的fun=any中使用的临时变量：

x[,a:=TRUE,]

实际扩大范围：

dcast(x, User_Id ~ Feature, fun=any, value.var="a")
#    User_Id adult female  male  teen
# 1:  user_1 FALSE  FALSE  TRUE  TRUE
# 2:  user_2  TRUE   TRUE FALSE FALSE

Answer 2

一种tidyverse方法：

library(tidyverse)

df %>% mutate(tmp = TRUE) %>% spread(Feature, tmp) %>% replace(., is.na(.), FALSE)

输出：

  User_Id adult female  male  teen
1  user_1 FALSE  FALSE  TRUE  TRUE
2  user_2  TRUE   TRUE FALSE FALSE

将要素行转换为以要素为列的表的最佳方法

2 个答案: