我有一个带有来自IMDB的电影的R数据框。
(这是CSV文件:http://had.co.nz/data/movies/movies.tab.gz)
类型由二进制表定义:
$ Action (int) 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,...
$ Animation (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ Comedy (int) 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1,...
$ Drama (int) 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0,...
$ Documentary (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ Romance (int) 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
$ Short (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
我想知道:是否有一种优雅的R-native方法将这个二进制表转换为相同数据框中的“Comedy,Romance”字符串?
提前感谢您的帮助!
答案 0 :(得分:2)
我认为这就是你想要的。
# Create some toy data like yours
set.seed(1)
n <- 5
ds <- as.data.frame(replicate(7, sample(0:1, n, replace = TRUE)))
names(ds) <- c("Action", "Animation", "Comedy", "Drama",
"Documentary", "Romance", "Short")
print(ds)
# Action Animation Comedy Drama Documentary Romance Short
#1 0 1 0 0 1 0 0
#2 0 1 0 1 0 0 1
#3 1 1 1 1 1 0 0
#4 1 1 0 0 0 1 0
#5 0 0 1 1 0 0 1
# Use each row as indicator vector
apply(ds, 1, function(r) paste(names(ds)[as.logical(r)], collapse = ", "))
#[1] "Animation, Documentary"
#[2] "Animation, Drama, Short"
#[3] "Action, Animation, Comedy, Drama, Documentary"
#[4] "Action, Animation, Romance"
#[5] "Comedy, Drama, Short"
答案 1 :(得分:0)
以下是使用data.table
library(data.table)
library(reshape2)
setDT(melt(as.matrix(ds)))[value!=0][,toString(Var2) ,Var1]
答案 2 :(得分:0)
我也选择data.table:
library(readr)
library(data.table)
dt <- read_tsv("http://had.co.nz/data/movies/movies.tab.gz")
dt <- setkey(melt(setDT(dt), id.vars=1:17)[value==1], "title")
(dt <- unique(dt[dt[, .(categories=list(variable)), by=title]][, c("variable", "value"):=NULL]))
# title year length budget rating votes r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 mpaa categories
# 1: $ 1971 121 NA 6.4 348 4.5 4.5 4.5 4.5 14.5 24.5 24.5 14.5 4.5 4.5 NA Comedy,Drama
# 2: $1000 a Touchdown 1939 71 NA 6.0 20 0.0 14.5 4.5 24.5 14.5 14.5 14.5 4.5 4.5 14.5 NA Comedy
# 3: $21 a Day Once a Month 1941 7 NA 8.2 5 0.0 0.0 0.0 0.0 0.0 24.5 0.0 44.5 24.5 24.5 NA Animation,Short
# 4: $40,000 1996 70 NA 8.2 6 14.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 34.5 45.5 NA Comedy
# 5: $pent 2000 91 NA 4.3 45 4.5 4.5 4.5 14.5 14.5 14.5 4.5 4.5 14.5 14.5 NA Drama
# ---
# 44177: sIDney 2002 15 NA 7.0 8 14.5 0.0 0.0 14.5 0.0 0.0 24.5 14.5 14.5 24.5 NA Action,Short
# 44178: tom thumb 1958 98 NA 6.5 274 4.5 4.5 4.5 4.5 14.5 14.5 24.5 14.5 4.5 4.5 NA Animation
# 44179: www.XXX.com 2003 105 NA 1.1 12 45.5 0.0 0.0 0.0 0.0 0.0 24.5 0.0 0.0 24.5 NA Drama,Romance
# 44180: xXx 2002 132 85000000 5.5 18514 4.5 4.5 4.5 4.5 14.5 14.5 14.5 14.5 4.5 4.5 PG-13 Action
# 44181: xXx: State of the Union 2005 101 87000000 3.9 1584 24.5 4.5 4.5 4.5 4.5 14.5 4.5 4.5 4.5 14.5 PG-13 Action
您可能希望将类别列保留为向量或列表,以便能够轻松处理:
head(dt$categories, 2)
# [[1]]
# [1] Comedy Drama
# Levels: Action Animation Comedy Drama Documentary Romance Short
#
# [[2]]
# [1] Comedy
# Levels: Action Animation Comedy Drama Documentary Romance Short