grep数据框中每个唯一变量的前两行

时间:2016-04-06 18:04:08

标签: r

从下面的数据框中,我想为V5中的每个唯一标识符选择两个第一行。我不知道从哪里开始。

> head(Up,1000)
                     V1          V2                 V3   V4              V5 V6   V7   V8     V9 V10    V11 V12
1     ENSG00000124357.8        NAGK  ENST00000418807.3 9606 hsa-miR-106a-5p  3  114  121 -0.726  99 -0.726  99
2    ENSG00000131351.10       HAUS8  ENST00000253669.5 9606 hsa-miR-106a-5p  3   27   34 -0.714  99 -0.714  99
3     ENSG00000108702.3        CCL1  ENST00000225842.3 9606 hsa-miR-106a-5p  3  139  146 -0.670  99 -0.670  99
4    ENSG00000189159.11         HN1  ENST00000476258.1 9606 hsa-miR-123a-5p  3  107  114 -0.667  99 -0.666  99
5    ENSG00000154640.10        BTG3  ENST00000339775.6 9606 hsa-miR-123a-5p  3  167  174 -0.665  99 -0.665  99
6    ENSG00000087494.11       PTHLH  ENST00000395872.1 9606 hsa-miR-123a-5p  3  291  298 -0.629  99 -0.629  99
7     ENSG00000197885.6     NKIRAS1  ENST00000388759.3 9606 hsa-miR-155a-5p  3  141  148 -0.628  99 -0.628  99
8    ENSG00000146826.10     C7orf43  ENST00000394035.2 9606 hsa-miR-155a-5p  3  491  498 -0.614  99 -0.613  99
9    ENSG00000117616.13     C1orf63  ENST00000243189.7 9606 hsa-miR-155a-5p  3   37   44 -0.585  99 -0.585  99
10    ENSG00000144583.4      MARCH4  ENST00000273067.4 9606 hsa-miR-155a-5p -2 1353 1359 -0.575  99 -0.575  99
11    ENSG00000213928.4        IRF9  ENST00000396864.3 9606 hsa-miR-1323-5p  3  305  312 -0.567  99 -0.567  99
12    ENSG00000072849.6       DERL2  ENST00000572834.1 9606 hsa-miR-1323-5p  3  253  260 -0.566  99 -0.566  99
13   ENSG00000155366.12        RHOC  ENST00000339083.7 9606 hsa-miR-1323-5p  3  268  275 -0.554  99 -0.552  99
14    ENSG00000179431.5        FJX1  ENST00000317811.4 9606 hsa-miR-1323-5p  3  771  778 -0.550  99 -0.550  99
15   ENSG00000067057.12        PFKP  ENST00000381125.4 9606 hsa-miR-1323-5p  3   73   80 -0.547  99 -0.547  99
16    ENSG00000204923.3      FBXO48  ENST00000377957.3 9606 hsa-miR-1323-5p  3  159  166 -0.531  99 -0.531  99
17   ENSG00000120539.10       MASTL  ENST00000342386.6 9606 hsa-miR-1323-5p  3  246  253 -0.529  99 -0.529  99

数据

Up <- read.table(header = TRUE, stringsAsFactors = FALSE, text="V1          V2                 V3   V4              V5 V6   V7   V8     V9 V10    V11 V12
1     ENSG00000124357.8        NAGK  ENST00000418807.3 9606 hsa-miR-106a-5p  3  114  121 -0.726  99 -0.726  99
2    ENSG00000131351.10       HAUS8  ENST00000253669.5 9606 hsa-miR-106a-5p  3   27   34 -0.714  99 -0.714  99
3     ENSG00000108702.3        CCL1  ENST00000225842.3 9606 hsa-miR-106a-5p  3  139  146 -0.670  99 -0.670  99
4    ENSG00000189159.11         HN1  ENST00000476258.1 9606 hsa-miR-123a-5p  3  107  114 -0.667  99 -0.666  99
5    ENSG00000154640.10        BTG3  ENST00000339775.6 9606 hsa-miR-123a-5p  3  167  174 -0.665  99 -0.665  99
6    ENSG00000087494.11       PTHLH  ENST00000395872.1 9606 hsa-miR-123a-5p  3  291  298 -0.629  99 -0.629  99
7     ENSG00000197885.6     NKIRAS1  ENST00000388759.3 9606 hsa-miR-155a-5p  3  141  148 -0.628  99 -0.628  99
8    ENSG00000146826.10     C7orf43  ENST00000394035.2 9606 hsa-miR-155a-5p  3  491  498 -0.614  99 -0.613  99
9    ENSG00000117616.13     C1orf63  ENST00000243189.7 9606 hsa-miR-155a-5p  3   37   44 -0.585  99 -0.585  99
10    ENSG00000144583.4      MARCH4  ENST00000273067.4 9606 hsa-miR-155a-5p -2 1353 1359 -0.575  99 -0.575  99
11    ENSG00000213928.4        IRF9  ENST00000396864.3 9606 hsa-miR-1323-5p  3  305  312 -0.567  99 -0.567  99
12    ENSG00000072849.6       DERL2  ENST00000572834.1 9606 hsa-miR-1323-5p  3  253  260 -0.566  99 -0.566  99
13   ENSG00000155366.12        RHOC  ENST00000339083.7 9606 hsa-miR-1323-5p  3  268  275 -0.554  99 -0.552  99
14    ENSG00000179431.5        FJX1  ENST00000317811.4 9606 hsa-miR-1323-5p  3  771  778 -0.550  99 -0.550  99
15   ENSG00000067057.12        PFKP  ENST00000381125.4 9606 hsa-miR-1323-5p  3   73   80 -0.547  99 -0.547  99
16    ENSG00000204923.3      FBXO48  ENST00000377957.3 9606 hsa-miR-1323-5p  3  159  166 -0.531  99 -0.531  99
17   ENSG00000120539.10       MASTL  ENST00000342386.6 9606 hsa-miR-1323-5p  3  246  253 -0.529  99 -0.529  99")

3 个答案:

答案 0 :(得分:4)

如果我们需要根据分组变量“V5”获取前两行,则一个选项为data.table。将“data.frame”转换为“data.table”(setDT(Up)),按“V5”分组,使用head获取前2行

library(data.table)
setDT(Up)[, head(.SD, 2) , by = V5]

在按“V5”分组后,使用slice中的dplyr

library(dplyr)
Up %>%
   group_by(V5) %>%
   slice(1:2)

正如@Frank在评论中提到的(关于bug),当初始数据集为data.table时,如果特定的“V5”具有少于2个元素,则输出显示另外的NA行。但是,如果我们使用data.frame,它就会起作用。

适用于data.tabledata.frame的选项将是(@Franks的评论)

Up %>%
    group_by(V5) %>%
    slice(head(seq_len(n()),2)) 

答案 1 :(得分:0)

使用基数R,以下内容可以使您的data.frame,Up:

非常接近
# get the first unique row
UpFirstTwoRows <- which(!duplicated(Up[, "V5"]))
# get the adjacent row, dropping cases where only one unique row exists
UpFirstTwoRows <- sort(unique(c(UpFirstTwoRows, UpFirstTwoRows + 1)))

UpNew <- Up[UpFirstTwoRows,]

答案 2 :(得分:0)

逻辑:

  • 根据给定列的唯一名称对data.frame进行切片。
  • 获取每个切片数据的第一个所需行数。

ARGUMENTS

  1. 数据:数据框
  2. col:数据的列名称
  3. num.rows:数字,表示根据唯一值划分数据后要切片的行数
  4. list:logical,如果为true则返回列表,否则返回data.frame

        fun <- function(data, col, num.rows, list = TRUE){
           d <- unique(data[[col]])
           dev.data <- lapply(d, FUN = function(i) { data[data[[col]] == i, ][1:num.rows, ]})
           if(!isTRUE(list)) { 
             return(do.call(rbind, dev.data)) 
           } else {return(dev.data)}
         } 
    
  5. 实施例

    fun(iris, "Species", 2, FALSE)
    fun(iris, "Species", 3, TRUE)