Question

我正在尝试从原始出现数据生成网络图数据。在原始数据中，我在各种上下文中都有特征的出现率。让我们说它是不同电影中的演员。每行是[上下文，特征，权重]，其中权重可能是屏幕时间量。这是一个玩具数据集：

df <- data.frame(context = sample(LETTERS[1:10], 500, replace=TRUE),
             feature = sample(LETTERS, 500, replace=TRUE),
             weight = sample(1:100, 500, replace=TRUE)
             )

因此，对于电影A，我们可能有20行，其中每一行都是演员的名字和他们在该电影中的屏幕时间。

我想要生成的是每部电影的所有演员的成对组合，以及各自权重的总和。例如，如果我们从：

开始

[A, A, 5]
[A, B, 2]

我想以[context，feature1，feature2，sum.weight]的格式输出。所以：

[A, A, B, 7]

我知道如何通过for循环的组合来完成此操作，但我想知道是否有更“经典的R”方法来处理这种情况，特别是像data.table这样的东西。

Answer 1

以下是使用data.table包的可能解决方案：

library(data.table)

# keep a record of feature's levels
feature.levels <- levels(df$feature)

# for each context, create a data table for all pair combinations of features,
# & sum of said pair's weights
df <- df[,
   as.data.table(
     cbind(t(combn(feature, 2)),
           rowSums(t(combn(weight, 2))))
   ),
   by = context]

# map features (converted into integers in the previous step) back to factors
df[,
   c('V1', 'V2') := lapply(.SD,
                           function(x){factor(x, labels = feature.levels)}),
   .SDcols = c('V1', 'V2')]

# rename features / sum weights
setnames(df,
         old = c("V1", "V2", "V3"),
         new = c("feature1", "feature2", "sum.weights"))

> head(df)
   context feature1 feature2 sum.weights
1:       C        j        l         373
2:       C        j        z         282
3:       C        j        v         382
4:       C        j        h         488
5:       C        j        c         280
6:       C        j        u         360

数据（我使用小写字母表示＆＃34;功能＆＃34;以便它在视觉上区别于大写＆＃34;上下文＆＃34;）：

set.seed(123)
df <- data.frame(context = sample(LETTERS[1:10], 500, replace=TRUE),
                 feature = sample(letters, 500, replace=TRUE),
                 weight = sample(1:100, 500, replace=TRUE))

# convert to data table & summarize to unique combinations by context + feature
setDT(df)
df <- df[, 
         list(weight = sum(weight)), 
         by = list(context, feature)]

计算因子向量与求和值的对 - 组合

1 个答案: