如何根据另一个数据框中的值定义数据框中的计算?

时间:2017-11-14 11:29:58

标签: r dataframe filtering

我必须根据存储在4936大小的数据框(A)中的数据集来计算系数。 x 1025 var。

在第一行[1]中,显示以秒为单位的时间,每行是从不同位置收集的样本。数据框A的样本:

#        V1   V2   V3   V4
# [1,] 26.4 26.5 26.6 26.7
# [2,]  -15   -5    2    3
# [3,]    6   -7    5    8
# [4,]    9    4    4   -2

在另一个数据框(B)中,我存储了我应该开始计算A中每行的时间。数据框B的一个例子:

#      time
# [1,] 26.4
# [2,] 26.6
# [3,] 26.5

让我们简化系数是在一个地方(数据框A)收集的数据的总和,这取决于它们收集的时间(数据框B)。对于上面的示例,计算应该如下:

sum1=-15+(-5)+2+3
sum2=5+8
sum3=4+4+(-2)

我想在新数据框中存储的计算结果,如下所示:

#       Sum
# [1,]  -15
# [2,]   13
# [3,]    6

如何根据存储在第二个数据帧中的值链接两个数据帧之间的计算?

3 个答案:

答案 0 :(得分:4)

使用sapply根据收集时间迭代并选择列的解决方案:

# Time from original table
foo <- df1[1, ]
# Time from table B
time <- c(26.4, 26.6, 26.5)

# Remove time row from original table
df1 <- df1[-1, ]

# Iterate over and select columns with foo >= time
sapply(1:length(time), function(x)
    sum(df1[x, which(foo >= time[x])])
)

# [1] -15  13   6

答案 1 :(得分:2)

I came across this already answered question and felt urged to propose an alternative solution.

  • Reading the title immediately made me think of join or merge.
  • The OP claims to use data frames but the printed output seems to originate from matrices.
  • The data is stored transposed: The time series are stored row-wise horizontally where the first row contains no observations but the time in seconds. This is considered untidy.

None of the other answers bothered to question these oddities although they made the proposed solutions more complex.

Reshaping the data

As a wild guess, the data seem to be collected in an Excel sheet. However, for an efficient processing we need the data to be stored column-wise and preferably in long format:

library(data.table)
long <- as.data.table(t(A))[
  , setnames(.SD, "V1", "time")][
    , melt(.SD, id.vars = "time", variable.name = "site_id")][
      , site_id := as.integer(site_id)][]

long
    time site_id value
 1: 26.4       1   -15
 2: 26.5       1    -5
 3: 26.6       1     2
 4: 26.7       1     3
 5: 26.4       2     6
 6: 26.5       2    -7
 7: 26.6       2     5
 8: 26.7       2     8
 9: 26.4       3     9
10: 26.5       3     4
11: 26.6       3     4
12: 26.7       3    -2

Aggregating in a non-equi join

Now, the OP has requested to aggregate the observations for each site but only observations above a specific time need to be included. A data frame B with the starting times for each site is supplied.

The observations in long can be combined with the starting times in B as follows:

B <- data.table(
  site_id = 1:3,
  time = c(26.4, 26.6, 26.5))

B
   site_id time
1:       1 26.4
2:       2 26.6
3:       3 26.5
# aggregating in a non-equi join grouped by the join conditions
long[B, on = .(site_id, time >= time), by = .EACHI, sum(value)] 
   site_id time  V1
1:       1 26.4 -15
2:       2 26.6  13
3:       3 26.5   6

Edit: Limit the number of observations in the aggregation

The OP has asked in a comment and in another question how to limit the number of observations to be aggregated after the starting time. This can be achieved by a slight modification:

max_values <- 2L
long[B, on = .(site_id, time >= time), by = .EACHI, sum(value[1:max_values])]  
   site_id time  V1
1:       1 26.4 -20
2:       2 26.6  13
3:       3 26.5   8

Note that max_values is set to 2L here for illustration.

答案 2 :(得分:0)

使用简单的for循环解决方案:

# recreate your data
V1 <- c(26.4, -15, 6, 9)
V2 <- c(26.5, -5, -7, 4)
V3 <- c(26.6, 2, 5, 4)
V4 <- c(26.7, 3, 8, -2)

A <- data.frame(V1, V2, V3, V4)
B <- data.frame(time = c(26.4, 26.6, 26.5))

#initialize empty variable to store sums in
sum_frame <- numeric()

# calculating sums
for (i in 1:NROW(B)) {
  sum_frame[i] <- sum(A[(i + 1), (which(A[1, ] == B$time[i])):NCOL(A)])
}

# turning sum-vector into a dataframe
sum_frame <- data.frame(sums = sum_frame)

输出:

> sum_frame
  sum_frame
1       -15
2        13
3         6