如何在一列中有效地查找模式并将相应的值分配给数据框列表中的另一列?

时间:2015-04-23 08:07:21

标签: r

我有一个包含15个数据帧的列表,每13列(时间+ 6个站,每3个层)和172行。我想通过在整个列表上应用一个函数来折叠那些列(站点上的观察)基本上两列(一个用于站点,一个用于观察)。在这里,我使用来自tidyr的聚集。另外,我想在其中一列中找到一个模式(上层,中层或下层),并在新列中分配一个新值(深度)。为此,我使用ddply中的plyr和grep。我的问题是它非常缓慢。我想我用有限的R知识创造了一个瓶颈。那么瓶颈在哪里以及如何改进它?

一个例子:

  data <- list(a = data.frame(time = (1:180), alpha.upper = sample(1:180),
               beta.middle = sample(1:180), gamma.lower = sample(1:180)), 
               b = data.frame(time(1:180), alpha.upper = sample(1:180), 
               beta.middle = sample(1:180), gamma.lower = sample(1:180)))

> data
$a
    time alpha.upper beta.middle gamma.lower
 1      1         133         179          99
 2      2         175         147          56
 3      3         169           9          24
 4      4         116         129          75
 5      5          92          65          65
 6      6         141          73          49
$b
    time alpha.upper beta.middle gamma.lower
1      1         111           2          89
2      2          84          81         159
3      3          93          82          84
4      4          44          58         125
5      5          31          33         131
6      6           1         120          63

我的代码是:

> data2<-lapply(data, function(x) {
               x<-gather(x,stn,value,-time)
               x<-arrange(x,time)
               x<-ddply(x,c("time","stn","value"), function(x) {
                 if (grepl(".upper",x$stn) == TRUE)
                 {
                   x$depth<-1
                   return(x)
                 }
                 if (grepl(".lower",x$stn) == TRUE)
                 {
                   x$depth<-3
                   return(x)
                 }
                 if (grepl(".middle",x$stn) == TRUE)
                 {
                   x$depth<-2
                   return(x)
                 }
               })
               return(x)
             })

结果应该是:

> data2
$a
    time         stn value depth
1      1 alpha.upper   111     1
2      1 beta.middle     2     2
3      1 gamma.lower    89     3
4      2 alpha.upper    84     1
5      2 beta.middle    81     2
6      2 gamma.lower   159     3
$b
1      1 alpha.upper    38     1
2      1 beta.middle   151     2
3      1 gamma.lower    93     3
4      2 alpha.upper    61     1
5      2 beta.middle    56     2
6      2 gamma.lower    66     3  

1 个答案:

答案 0 :(得分:0)

首先让我们重现你的数据。

dataa <- read.table(text =
"time alpha.upper beta.middle gamma.lower
1 133 179 99
2 175 147 56
3 169 9 24
4 116 129 75
5 92 65 65
6 141 73 49", header = T, sep = " ")

datab <- read.table(text =
"time alpha.upper beta.middle gamma.lower
1 1 111 2 89
2 2 84 81 159
3 3 93 82 84
4 4 44 58 125
5 5 31 33 131
6 6 1 120 63", header = T, sep = " ")

mydata <- list(a = dataa, b = datab)
# $a
#   time alpha.upper beta.middle gamma.lower
# 1    1         133         179          99
# 2    2         175         147          56
# 3    3         169           9          24
# 4    4         116         129          75
# 5    5          92          65          65
# 6    6         141          73          49
# $b
#   time alpha.upper beta.middle gamma.lower
# 1    1         111           2          89
# 2    2          84          81         159
# 3    3          93          82          84
# 4    4          44          58         125
# 5    5          31          33         131
# 6    6           1         120          63

这里我将变量命名为mydata,因为标准包data中有一个函数utils,最好不要将此名称用于变量。

据我所知,你需要将列表的每个data.frame从“宽”形式变为“长”形式。你可以使用来自gather软件包的tidyr,在我看来这是一个聪明的选择,但在这种情况下,我展示了如何使用基本的R工具获得相同的结果。

rebuilddf <- function(df)
{ # first of all see the difference between rep(1:3, each = 3) and rep(1:3, times = 3)

  res_df <- data.frame(
    time = rep(df$time, each = 3),# first column of new data.frame -
                                  # we repeat each time mark 3 times
                                  # as we know that there are exactly 3
                                  # observations for single time: upper, middle, lower

    stn = rep(colnames(df)[-1], times = nrow(df)), # second column
                                  # fill it with words "alpha.upper",
                              # "beta.middle", "gamma.lower" which are colnames(df)[-1]
                                  # repeated nrow(df) times 

    value = as.vector(t(as.matrix(df[,-1]))) ) # 
                                  # numbers of 2:4 columns of our data.frame are
                                  # transposed and then arranged in a vector
                                  # the result is like reading it row by row
  # to understand what's happening with the matrix you can try this code
  # m <- matrix(1:20, nrow = 4)
  #      [,1] [,2] [,3] [,4] [,5]
  # [1,]    1    5    9   13   17
  # [2,]    2    6   10   14   18
  # [3,]    3    7   11   15   19
  # [4,]    4    8   12   16   20
  # as.vector(t(m))
  # 1  5  9 13 17  2  6 10 14 18  3  7 11 15 19  4  8 12 16 20

  # after that we add column "depth"
  # as I got it, we need 1 for "upper", 2 for "middle" and 3 for "lower"
  # we make it with the help of two nested ifelse functions
  res_df <- transform(res_df, depth = ifelse(stn == "alpha.upper", 1, 
                                             ifelse(stn == "beta.middle", 2, 3)) )
  return(res_df)
}

如果列的名称并不总是相同,并且只有名称的末尾是不变的,我们可以修改depth的条件,如下所示:

res_df <- 
  transform(res_df, 
            depth = ifelse(rev(strsplit(stn, "[.]")[[1]])[1] == "upper", 
                           1, 
                           ifelse(rev(strsplit(stn, "[.]")[[1]])[1] == "middle", 2, 3)
                           ) )
# we work with 
# rev(strsplit(stn, "[.]")[[1]])[1]
# it may be "upper", "middle" or "lower"
# here we split character string of form "some.name1.upper" or
# "some.other.colname.lower" by every dot in text, then take 
# the first from end part of the string (rev does reversing order)

您也可以修改条件并使用grepl,但我相信strsplit会更快。

当我们完成rebuilddf功能后,让我们看一下它的功能。

lapply(mydata, rebuilddf)

# $a
#    time         stn value depth
# 1     1 alpha.upper   133     1
# 2     1 beta.middle   179     2
# 3     1 gamma.lower    99     3
# 4     2 alpha.upper   175     1
# 5     2 beta.middle   147     2
# 6     2 gamma.lower    56     3
# 7     3 alpha.upper   169     1
# 8     3 beta.middle     9     2
# 9     3 gamma.lower    24     3
# 10    4 alpha.upper   116     1
# 11    4 beta.middle   129     2
# 12    4 gamma.lower    75     3
# 13    5 alpha.upper    92     1
# 14    5 beta.middle    65     2
# 15    5 gamma.lower    65     3
# 16    6 alpha.upper   141     1
# 17    6 beta.middle    73     2
# 18    6 gamma.lower    49     3
# 
# $b
#    time         stn value depth
# 1     1 alpha.upper   111     1
# 2     1 beta.middle     2     2
# 3     1 gamma.lower    89     3
# 4     2 alpha.upper    84     1
# 5     2 beta.middle    81     2
# 6     2 gamma.lower   159     3
# 7     3 alpha.upper    93     1
# 8     3 beta.middle    82     2
# 9     3 gamma.lower    84     3
# 10    4 alpha.upper    44     1
# 11    4 beta.middle    58     2
# 12    4 gamma.lower   125     3
# 13    5 alpha.upper    31     1
# 14    5 beta.middle    33     2
# 15    5 gamma.lower   131     3
# 16    6 alpha.upper     1     1
# 17    6 beta.middle   120     2
# 18    6 gamma.lower    63     3

我想相信这是您想要的输出,但是在a b数据框架中向我们展示的问题{{1}},反之亦然。