Question

我有一个来自ChIP-seq实验的非常大且复杂的数据框。该数据框包含小DNA测序读数的结合位置和分数。输出来自程序spp。数据由染色体组成，名称（spp_output $ npl）给出所有染色体的列表。如果我选择染色体（比如说chr1）

的输出

> names(spp_output$npl$chr1)
 [1] "x"          "y"          "evalue"     "fdr"        "nt"        
 [6] "nbgt"       "enr"        "enr.ub"     "enr.mle"    "nbgt.5"    
[11] "enr.5"      "enr.ub.5"   "enr.mle.5"  "nbgt.10"    "enr.10"    
[16] "enr.ub.10"  "enr.mle.10"

我需要按分数（“y”）对每条染色体的数据集进行子集化。如果我

，它似乎不起作用

`spp_sub <- subset(spp_output, spp_output$npl$chr1$y > 40);`

即便如此，我还是要为每一个chr插上一条线。你会如何对这些数据进行分组？ str（spp_output）看起来像这样

 > str(bp)
List of 3890
 $ npl                 :List of 21
  ..$ chr1 :'data.frame':   5988 obs. of  17 variables:
  .. ..$ x         : num [1:5988] 3480459 3662576 3720354 4043865 4078829 ...
  .. ..$ y         : num [1:5988] 3.47 3.97 3.5 3.64 3.19 ...
  .. ..$ evalue    : num [1:5988] 162 88 159 135 283 ...
  .. ..$ fdr       : num [1:5988] 0.00296 0.00226 0.00305 0.00279 0.00445 ...
  .. ..$ nt        : int [1:5988] 6 5 5 0 5 4 3 5 1 1 ...
  .. ..$ nbgt      : int [1:5988] 1 1 1 3 2 0 4 2 0 0 ...
  .. ..$ enr       : num [1:5988] 3.033091 2.40555 2.40555 0.000391 1.656698 ...
  .. ..$ enr.ub    : num [1:5988] 207.74 176.66 176.66 3.79 47.37 ...
  .. ..$ enr.mle   : num [1:5988] 13.594 11.503 11.503 0.448 6.902 ...
  .. ..$ nbgt.5    : int [1:5988] 4 2 7 10 11 8 15 10 5 1 ...
  .. ..$ enr.5     : num [1:5988] 6.668665 8.28349 3.364963 0.000622 2.296721 ...
  .. ..$ enr.ub.5  : num [1:5988] 89.4 236.83 35.41 4.51 20.01 ...
  .. ..$ enr.mle.5 : num [1:5988] 22.657 34.509 11.503 0.747 7.502 ...
  .. ..$ nbgt.10   : int [1:5988] 7 4 16 11 20 17 30 14 19 11 ...
  .. ..$ enr.10    : num [1:5988] 8.69484 10.39636 3.29297 0.00113 2.68587 ...
  .. ..$ enr.ub.10 : num [1:5988] 81.36 154.57 25.68 8.13 19.91 ...
  .. ..$ enr.mle.10: num [1:5988] 27.19 38.34 10.46 1.36 8.42 ...
  ..$ chr10:'data.frame':   3907 obs. of  17 variables:
  .. ..$ x         : num [1:3907] 3030162 3047828 3077057 3086391 3163429 ...
  .. ..$ y         : num [1:3907] 3.26 6.06 3 4.43 3.6 ...
  .. ..$ evalue    : num [1:3907] 256 12 426 44 145 ...
  .. ..$ fdr       : num [1:3907] 0.00416 0.00103 0.00638 0.00139 0.00283 ...
  .. ..$ nt        : int [1:3907] 2 6 3 4 3 6 4 0 3 3 ...
  .. ..$ nbgt      : int [1:3907] 0 1 1 1 1 1 4 0 2 4 ...
  .. ..$ enr       : num [1:3907] 1.49 3.03 1.2 1.79 1.2 ...
  .. ..$ enr.ub    : num [1:3907] 17463 208 114 146 114 ...
  .. ..$ enr.mle   : num [1:3907] 15.69 13.59 7.32 9.41 7.32 ...
  .. ..$ nbgt.5    : int [1:3907] 11 7 12 8 10 4 12 4 8 9 ...
  .. ..$ enr.5     : num [1:3907] 0.519 4.347 0.964 2.167 1.135 ...
  .. ..$ enr.ub.5  : num [1:3907] 11.1 40.7 12.8 25.4 15.9 ...
  .. ..$ enr.mle.5 : num [1:3907] 3.41 13.59 4.39 8.3 5.23 ...
  .. ..$ nbgt.10   : int [1:3907] 21 13 17 10 18 15 12 11 19 22 ...
  .. ..$ enr.10    : num [1:3907] 0.567 5.158 1.401 3.577 1.328 ...
  .. ..$ enr.ub.10 : num [1:3907] 10.7 37.4 17.1 38.4 16.1 ...
  .. ..$ enr.mle.10: num [1:3907] 3.65 15.1 6.27 13.44 5.94 ...

......对于其余的染色体。非常感谢任何帮助或建议!!

Answer 1

这是Split-Apply-Combine工作的经典案例。

require(plyr)
llply(spp_output$npl, subset, y > 10)

Answer 2

您似乎只想从bp$npl中选择y值大于40的那些chrs。如果是这样，以下内容应该会给你想要的结果：

# find which chrs have a y value greater than 40
Indxs <- lapply(bp$npl, function(n) n$y > 40)

# subset just those
croppedList <- mapply(function(m, i) m[i, ], bp$npl, Indxs, SIMPLIFY=FALSE)

<小时/>

  ## BEFORE:

  > bp
  $npl
  $npl$chr1
         x  y evalue   fdr
  1  -0.63 77  -0.04 -0.06
  2   0.18 40  -0.02 -0.16
  3  -0.84 63   0.94 -1.47
  4   1.60 36   0.82 -0.48
  5   0.33 43   0.59  0.42
  6  -0.82 49   0.92  1.36
  7   0.49 30   0.78 -0.10
  8   0.74 49   0.07  0.39
  9   0.58 74  -1.99 -0.05
  10 -0.31 47   0.62 -1.38

  $npl$chr2
         x  y evalue   fdr
  1  -0.41 42   0.40  2.40
  2  -0.39 33  -0.61 -0.04
  3  -0.06 62   0.34  0.69
  4   1.10 74  -1.13  0.03
  5   0.76 69   1.43 -0.74
  6  -0.16 70   1.98  0.19
  7  -0.25 53  -0.37 -1.80
  8   0.70 50  -1.04  1.47
  9   0.56 71   0.57  0.15
  10 -0.69 60  -0.14  2.17

  ## AFTER: 

  > croppedList
  $chr1
         x  y evalue   fdr
  1  -0.63 77  -0.04 -0.06
  3  -0.84 63   0.94 -1.47
  5   0.33 43   0.59  0.42
  6  -0.82 49   0.92  1.36
  8   0.74 49   0.07  0.39
  9   0.58 74  -1.99 -0.05
  10 -0.31 47   0.62 -1.38

  $chr2
         x  y evalue   fdr
  1  -0.41 42   0.40  2.40
  3  -0.06 62   0.34  0.69
  4   1.10 74  -1.13  0.03
  5   0.76 69   1.43 -0.74
  6  -0.16 70   1.98  0.19
  7  -0.25 53  -0.37 -1.80
  8   0.70 50  -1.04  1.47
  9   0.56 71   0.57  0.15
  10 -0.69 60  -0.14  2.17

Answer 3

似乎问题在于您正在处理嵌套列表。这是一个类似于你的最小例子：

# Some sample data. This is a nested list
set.seed(123)
myList <- list(
  npl = list(
    chr1 = data.frame(x = rnorm(10),
                      y = sample(30, 10, replace = TRUE)),
    chr2 = data.frame(x = rnorm(10),
                      y = sample(30, 10, replace = TRUE))
  )
)

# Compare the structure
str(myList)
# List of 1
#  $ npl:List of 2
#   ..$ chr1:'data.frame':  10 obs. of  2 variables:
#   .. ..$ x: num [1:10] -0.5605 -0.2302 1.5587 0.0705 0.1293 ...
#   .. ..$ y: int [1:10] 27 21 20 30 20 22 17 18 9 5
#   ..$ chr2:'data.frame':  10 obs. of  2 variables:
#   .. ..$ x: num [1:10] 1.787 0.498 -1.967 0.701 -0.473 ...
#   .. ..$ y: int [1:10] 2 14 24 4 17 7 4 23 27 12

那么，我们如何从所有这些子列表中的“y”列中提取某些值？将lapply与unlist一起使用recursive = FALSE，如下所示：

lapply(unlist(myList, recursive=FALSE), function(x) x[x$y > 5, ])
# $npl.chr1
#             x  y
# 1 -0.56047565 27
# 2 -0.23017749 21
# 3  1.55870831 20
# 4  0.07050839 30
# 5  0.12928774 20
# 6  1.71506499 22
# 7  0.46091621 17
# 8 -1.26506123 18
# 9 -0.68685285  9
# 
# $npl.chr2
#             x  y
# 2   0.4978505 14
# 3  -1.9666172 24
# 5  -0.4727914 17
# 6  -1.0678237  7
# 8  -1.0260044 23
# 9  -0.7288912 27
# 10 -0.6250393 12

要查看正在发生的情况，请尝试str(unlist(myList, recursive = FALSE))。您会看到该列表现已在某种程度上“扁平化”。这样，您就可以使用lapply直接转到单个data.frame。

在嵌套列表中使用R和复杂数据框进行子集

3 个答案: