Question

我有一个大型数据集，其中包含文本注释以及它们对不同变量的评级，例如：

df <- data.frame(
  comment = c("commentA","commentB","commentB","commentA","commentA","commentC" 
  sentiment=c(1,2,1,4,1,2), 
  tone=c(1,5,3,2,6,1)
)

每条评论都会出现1至3次，因为有时会要求多个人对同一条评论进行评分。

我正在寻找一个数据框，其中“ comment”列仅具有唯一值，而其他列被附加，因此任何一个文本注释的“ sentiment”和“ tone”列数量均与等级相同（这将导致NA的评论没有被经常评级，但这没关系）：

df <- data.frame(
  comment = c("commentA","commentB","commentC",
  sentiment.1=c(1,2,2), 
  sentiment.2=c(4,1,NA), 
  sentiment.3=c(1,NA,NA), 
  tone.1=c(1,5,1),
  tone.2=c(2,3,NA),
  tone.3=c(6,NA,NA)
)

我一直在尝试使用reshape来解决这个问题，从而从长到宽使用

reshape(df, 
  idvar = "comment",
  timevar = c("sentiment","tone"), 
  direction = "wide"
)

但这会导致情感和语气之间的所有可能组合，而不是简单地独立复制情感和语气。

我也像gather一样尝试使用df %>% gather(key, value, -comment)，但这只能使我半途而废...

有人可以指出正确的方向吗？

Answer 1

您需要创建一个变量以用作列中的数字。 rowid(comment)可以解决问题。

在dcast中，将行标识符放在~的左边，将列标识符放在右边。然后，value.var是要包含在此长到宽转换中的所有列的字符向量。

library(data.table)
setDT(df)

dcast(df, comment ~ rowid(comment), value.var = c('sentiment', 'tone'))

#     comment sentiment_1 sentiment_2 sentiment_3 tone_1 tone_2 tone_3
# 1: commentA           1           4           1      1      2      6
# 2: commentB           2           1          NA      5      3     NA
# 3: commentC           2          NA          NA      1     NA     NA

R通过追加列合并重复的行

1 个答案: