更新

Question

我想应用(Occ_1+1)/(Totl_1+Unique_words)的数学计算，(Occ_2+1)/(Totl_2+Unique_words)和(Occ_3+1)/(Totl_3+Unique_words)，并创建一个新列，分别为Probability_1，Probability_2，Probability_3

现在我要分别进行每个计算，并将它们组合在一起。

例如：对于(Occ_1+1)，我正在做sapply(df$Occ_1, function(x){x+1})。

我几乎有50 Occ_和50 Totl_，所以如果我分别进行所有计算，我的代码将变得很冗长。
有没有一种方法可以一次完成所有计算。

仅对 DF 进行采样，直到Occ_3和Totl_3

 word        Occ_1  Occ_2  Occ_3  Totl_1 Totl_2 Totl_3 Unique_words
  <chr>      <int>  <int>  <int>  <int>  <int>  <int>        <int>
 1 car          0     1     0     11      9      7           17
 2 saturn       2     0     2     11      9      7           17
 3 survival     1     2     0     11      9      7           17
 4 baseball     1     1     0     11      9      7           17
 5 color        0     0     1     11      9      7           17
 6 muscle       0     1     0     11      9      7           17

Answer 1

我只需将所有Occ..，Tot..列收集在一起并执行所需的算术运算

occ_cols <- grep("^Occ", names(df))
tot_cols <- grep("^Totl", names(df))

df[paste0("Probability_", 1:length(occ_cols))] <- 
      (df[occ_cols] + 1)/(df[tot_cols] + df$Unique_words)

df
#      word Occ_1 Occ_2 Occ_3 Totl_1 Totl_2 Totl_3 Unique_words Probability_1
#1      car     0     1     0     11      9      7           17    0.03571429
#2   saturn     2     0     2     11      9      7           17    0.10714286
#3 survival     1     2     0     11      9      7           17    0.07142857
#4 baseball     1     1     0     11      9      7           17    0.07142857
#5    color     0     0     1     11      9      7           17    0.03571429
#6   muscle     0     1     0     11      9      7           17    0.03571429

#  Probability_2 Probability_3
#1    0.07692308    0.04166667
#2    0.03846154    0.12500000
#3    0.11538462    0.04166667
#4    0.07692308    0.04166667
#5    0.03846154    0.08333333
#6    0.07692308    0.04166667

但是，请确保您所有的Occ..和Tot..列都使用相同的顺序。对于此示例，我们有Occ_1，Occ_2，Occ_3，后跟Totl_1，Totl_2和Totl_3。

Answer 2

我将提出与其他两个答案不同的方法。我认为您在这里使用的数据格式错误，即您的数据应该很长时就很宽。如果您不熟悉这些术语，则应在线查看许多说明。我认为最好的是this one。

使用tidyr包，我将解决您的问题，如下所示：

library(tidyverse)

第一步是将Occ和Totl列分成2个数据帧，我们稍后将合并它们。使用gather函数，我将这些列转换为键值对。我们正在从键中提取数值，以便以后可以将Occ_1与Totl_1匹配。

df_occ <- df %>%
  gather(group, occ, contains("Occ")) %>%
  select(word, group, occ) %>%
  mutate(group = str_extract(group, "[0-9]") %>% as.integer())

df_totl <- df %>%
  gather(group, totl, contains("Totl")) %>%
  select(word, group, totl) %>%
  mutate(group = str_extract(group, "[0-9]") %>% as.integer())

一旦有了这两个数据框，我们就将它们合并在一起。我们从原始数据帧中提取word和Unique_words列，然后按组添加Occ数据帧，最后添加Totl数据帧。最后，我们可以用一行代码进行所需的计算。

df_merge <- df %>%
  select(word, Unique_words) %>%
  left_join(df_occ, by = 'word') %>%
  left_join(df_totl, by = c('word', 'group')) %>%
  mutate(prob = (occ + 1) / (totl + Unique_words))

如果要将其转换回较宽的格式，则可以使用gather函数的反函数，即spread。

df_wide <- df_merge %>%
  select(word, group, prob) %>%
  mutate(group = paste0("Prob_", group)) %>%
  spread(group, prob)

这种方法的优点：

您的代码更清晰易懂，每个操作都单独一行，并且避免使用方括号（方括号通常会创建难以阅读的代码）。
您的代码显示了中间步骤。
该方法更加灵活，希望也可以简化其他处理步骤。

Answer 3

这实际上是所谓的函数向量化，它可以显着提高代码的性能。

但是首先，为了让您知道以后的问题，使用dput

提供示例数据要容易得多。

dput(df)

然后想要回答问题的人可以简单地使用输出：

df <- dget(structure(list(word = structure(c(2L, 5L, 6L, 1L, 3L, 4L), .Label = c("baseball", 
"car", "color", "muscle", "saturn", "survival"), class = "factor"), 
    Occ_1 = c(0L, 2L, 1L, 1L, 0L, 0L), Occ_2 = c(1L, 0L, 2L, 
    1L, 0L, 1L), Occ_3 = c(0L, 2L, 0L, 0L, 1L, 0L), Totl_1 = c(11L, 
    11L, 11L, 11L, 11L, 11L), Totl_2 = c(9L, 9L, 9L, 9L, 9L, 
    9L), Totl_3 = c(7L, 7L, 7L, 7L, 7L, 7L), Unique_words = c(17L, 
    17L, 17L, 17L, 17L, 17L), Probability_1 = c(0.0357142857142857, 
    0.107142857142857, 0.0714285714285714, 0.0714285714285714, 
    0.0357142857142857, 0.0357142857142857), Probability_2 = c(0.0769230769230769, 
    0.0384615384615385, 0.115384615384615, 0.0769230769230769, 
    0.0384615384615385, 0.0769230769230769), Probability_3 = c(0.0416666666666667, 
    0.125, 0.0416666666666667, 0.0416666666666667, 0.0833333333333333, 
    0.0416666666666667)), row.names = c(NA, -6L), class = "data.frame"))

无论如何，这是一种做您想要的事情的方法：

df$Probability_1 <- (df$Occ_1 + 1) / (df$Totl_1 + df$Unique_words)
df$Probability_2 <- (df$Occ_2 + 1) / (df$Totl_2 + df$Unique_words)
df$Probability_3 <- (df$Occ_3 + 1) / (df$Totl_3 + df$Unique_words)

或者，如果您更喜欢dplyr：

library("dplyr")
df_new <- df %>% 
  mutate(
    Probability_1 = (Occ_1 + 1) / (Totl_1 + Unique_words),
    Probability_2 = (Occ_2 + 1) / (Totl_2 + Unique_words),
    Probability_3 = (Occ_3 + 1) / (Totl_3 + Unique_words)        
  )

更新

我错过了问题的重点。与Occ和Totl变量的数量有关。我将使用for循环解决此问题，该循环应该仍然非常有效：

for(i in gsub("^Occ_", "", grep("^Occ_*", colnames(df), value = TRUE))) {
  df[paste0("Probability_", i)] <- 
    (df[paste0("Occ_", i)] + 1) / (df[paste0("Totl_", i)] + df$Unique_words)
}

通过列值将数学计算应用于DF的所有行

3 个答案:

更新