Question

我正在尝试根据一系列条件更改数据框列中条目的值。我需要更改某个类型的顶部（或底部）10个条目的“组”值。

我的数据位于数据框中，如下所示：

> head(diff_df_min)
  external_gene_name   gene_biotype  Fold  p.value     group
1      RP11-431K24.1        lincRNA -4.13 4.86e-06 signif_fc
2              UBE4B protein_coding  2.42 3.91e-06 signif_fc
3             UBIAD1 protein_coding  2.74 5.58e-05 signif_fc
4             PTCHD2 protein_coding  3.37 2.68e-06 signif_fc
5             DRAXIN protein_coding  3.04 1.42e-06 signif_fc
6             VPS13D protein_coding  4.26 1.60e-07 signif_fc

> dim(diff_df_min)
[1] 1824    5

我已经用dplyr找出了这个解决方案：

diff_df_min %>%
        filter(gene_biotype == "protein_coding") %>% # subset for protein coding genes
        arrange(-Fold, p.value) %>% # Sort by Fold change, then by p value
        slice(1:10) %>% # take the top 10 entries... 
        mutate(group = "top_signif_fc") # ... and change the "group" column value to "top_signif_fc"

这给出了我想要的确切结果：

   external_gene_name   gene_biotype Fold  p.value         group
1               CROCC protein_coding 5.46 3.44e-14 top_signif_fc
2               KCNA2 protein_coding 5.43 2.08e-11 top_signif_fc
3             PITPNC1 protein_coding 5.32 8.16e-11 top_signif_fc
4                RRP8 protein_coding 5.31 1.01e-10 top_signif_fc
5             HEPACAM protein_coding 5.27 1.26e-10 top_signif_fc
6              SGK223 protein_coding 5.14 3.45e-15 top_signif_fc
7               DDX3Y protein_coding 5.03 1.82e-09 top_signif_fc
8            ARHGAP10 protein_coding 4.99 2.83e-09 top_signif_fc
9              RNF180 protein_coding 4.98 3.19e-09 top_signif_fc
10              CSPG5 protein_coding 4.97 9.92e-12 top_signif_fc

除非这不是在原始数据帧中更新这些值，否则它仅在应用函数后显示结果。同样，我试图在data.table中做同样的事情并找出这个方法：

setDT(diff_df_min,key = "external_gene_name")
diff_df_min[gene_biotype == "protein_coding"][order(-Fold, p.value), head(.SD, 10)][,group := "top_signif_fc"]

但是这只是RETURNS的结果，它不会更新原始数据帧。

    external_gene_name   gene_biotype Fold  p.value         group
 1:              CROCC protein_coding 5.46 3.44e-14 top_signif_fc
 2:              KCNA2 protein_coding 5.43 2.08e-11 top_signif_fc
 3:            PITPNC1 protein_coding 5.32 8.16e-11 top_signif_fc
 4:               RRP8 protein_coding 5.31 1.01e-10 top_signif_fc
 5:            HEPACAM protein_coding 5.27 1.26e-10 top_signif_fc
 6:             SGK223 protein_coding 5.14 3.45e-15 top_signif_fc
 7:              DDX3Y protein_coding 5.03 1.82e-09 top_signif_fc
 8:           ARHGAP10 protein_coding 4.99 2.83e-09 top_signif_fc
 9:             RNF180 protein_coding 4.98 3.19e-09 top_signif_fc
10:              CSPG5 protein_coding 4.97 9.92e-12 top_signif_fc

在运行任何这些命令后检查数据框中的值（或再次运行命令的子集）时，可以看到这一点：

> diff_df_min[which(diff_df_min['external_gene_name'] == "CROCC"),]
    external_gene_name   gene_biotype Fold  p.value     group
372              CROCC protein_coding 5.46 3.44e-14 signif_fc

当然，如果您尝试使用以下任一方法：

diff_df_min <- ...

您最终只使用dplyr或data.table选择的10行覆盖原始数据框。

我之前在基地R做过类似的事情，但是无法让这个案子发挥作用。我试了一下，最后得到了这个，这太荒谬了，不能正常工作：

diff_df_min[with(diff_df_min[which(diff_df_min['gene_biotype'] == "protein_coding"),], order(-Fold, p.value) ),"group"][1:top_gene_number] <- "top_signif_fc"

^^在此过程中，索引会混乱，因此最终更改的条目不是预期的条目。

到目前为止，我已经阅读了数十页和几十页，包括许多教程甚至this但到目前为止我一直无法找到任何实际上为此提供解决方案的内容。我不想简单地打印出修改过的数据帧，我想用新条目更新原始数据帧条目。

Answer 1

我们可以使用ifelse语句进行更改，而不是slice对其进行子集化，并使用基于filter的{{1}}替换arrange（删除行）也可以在“protein_coding”上，并将输出分配回原始数据集或新的

diff_df_minNew <- diff_df_min %>%
                     arrange(desc(gene_biotype == "protein_coding"), 
                                desc(Fold), p.value) %>% 
                     mutate(group = ifelse(row_number() < 11, "top_signif_fc", group))

使用data.table的相应选项将是

library(data.table)
diff_df_minNew2 <- setDT(diff_df_min)[order(-(gene_biotype=="protein_coding"),
      -Fold, p.value)][seq_len(10), group := "top_signif_fc"][]

dplyr在子集

1 个答案: