Question

我有一个包含多个变量的数据集，在这里我想按行对两列的值求和。如果总和低于设定的阈值，我想用总和值替换第二列（正在求和）的值。但是，我想按组进行操作。

我的数据框设置有18个不同的列，其中包括“ Closed_Grid”，“ Closed_Sets”，“ BestAvail”，“ Best_Sets”和“ Best_Distance”。对于“第二最佳”，“第三”，“第四”和“第五”重复“ BestAvail”，“ Best_Sets”和“ Best_Distance”。我正在使用此信息来确定最终的目标位置（第18列“ Dest_Grid”），该位置将根据“ Closed_Sets”和目标集的条件总和由“ BestAvail”，“ 2nd_Best”等中的网格索引填充（最佳，第二等）。最后，如果<= 150中两列的总和，则该网格单元格（“ BestAvail”）将为“ Dest_Grid”。如果总和> 150，它将继续到下一个块，并计算“ Closed_Sets”和“ 2nd_Best”之间的新总和，依此类推，直到所有“ Closed_Sets”都具有“ Dest_Grid”。

因此，为了简化我的工作，我的数据集的一个样本（和子集）看起来像这样：

Closed_Grid Closed_Sets BestAvail Best_Sets
GY38         72.875     GX38       91.75
GY37         87.125     GX38       91.75
GY36         39.875     GX38       91.75
GZ38         29         GX38       91.75
GZ37         80         GX38       91.75
GY35         2.375      GX38       91.75
GZ36         125.25     GX38       91.75
GZ35         29.875     GX38       91.75
GY39         17.5       GX39       54.125
HA35         34.375     GZ33       30.5
GZ41         109.625    GZ42       76.76
GY41         82.28571   GZ42       76.75
HA41         87.5       GZ42       76.75
GZ40         104.75     GZ42       76.75
GY40         60.625     GZ42       76.75
HA40         79.875     GZ42       76.75
GZ39         51.57143   GZ42       76.75
HA39         71         GZ42       76.75

我首先使用“ BestAvail”和“ Distance”（从最小到最大）排列了数据，

Destination <- Destination %>% arrange(BestAvail, BestDistance)

这是一个重要的顺序，因为与BestAvail距离最短的Closed_Grid具有进入该网格的优先级。

所以现在我要在一个组（即“ BestAvail”相同的地方）中按行求和“ Closed_Sets”和“ Best_Sets”。每当一行的总和小于阈值（150）时，“ Best_Sets”值将替换为先前的总和。所以，我想要的输出是这样的：

Closed_Grid Closed_Sets BestAvail Best_Sets BestSum
GY38         72.875     GX38       91.75    164.6250
GY37         87.125     GX38       91.75    178.8750
GY36         39.875     GX38       91.75    131.625
GZ38         29         GX38       131.625  160.625  
GZ37         80         GX38       131.625  211.625
GY35         2.375      GX38       131.625  134.00
GZ36         125.25     GX38       134.00   259.250
GZ35         29.875     GX38       134.00   163.8750
GY39         17.5       GX39       54.125   71.625
HA35         34.375     GZ33       30.5     64.875
GZ41         109.625    GZ42       76.75    186.375
GY41         82.28571   GZ42       76.75    159.03571
HA41         87.5       GZ42       76.75    164.25
GZ40         104.75     GZ42       76.75    181.5
GY40         60.625     GZ42       76.75    137.375
HA40         79.875     GZ42       137.375  217.25
GZ39         51.57143   GZ42       137.375  188.94643
HA39         71         GZ42       137.375  208.375

我可以使用以下循环部分实现这一目标：

for (i in 1:nrow(Destination)){
    Destination$BestSum[i] <- sum(Destination$Closed_Sets[i], Destination$Best_Sets[i])
    if (Destination$BestSum[i] <= 150){
      Destination [i:length(Destination),"Best_Sets"] <- Destination$BestSum[i]
    }
  }

但是，此代码使所有“ Best_Sets”的值为134，并且在“ BestAvail”值更改时不会重新启动，这反过来会混淆以下所有总和。最终，我试图对组中的每个“ Closed_Set”进行有条件的累积总和，该总和保持在150以下。

这是我正在研究的模型的一部分，该模型将运行150多个单独的数据集，且所有数据集的长度和值都不同。此特定的代码位也将需要遍历第二，第三等设置，因此它必须是可以重复的并且变量很容易更改。

我尝试在循环中使用unique（）函数，尝试使自己的函数在dplyr中使用（这很理想！），尝试使用reset函数进行不同的累加总和，并在此搜索了数百个线程点。

我对R和编程技术还比较陌生，并且很难弄清楚该如何做。我已经就与此相关的每个可能的问题进行了多次讨论，但似乎无法将其用于我的数据。

我希望我想要实现的目标是有意义的。

谢谢！

Answer 1

注意：以下R代码不是很惯用，可能会很慢。我不建议您将这种样式用于常见任务。

# build the data frame
Closed_Grid = c(
  "GY38",
  "GY37",
  "GY36",
  "GZ38",
  "GZ37",
  "GY35",
  "GZ36",
  "GZ35",
  "GY39",
  "HA35",
  "GZ41",
  "GY41",
  "HA41",
  "GZ40",
  "GY40",
  "HA40",
  "GZ39",
  "HA39"
)

Closed_Sets = c(
  72.875,
  87.125,
  39.875,
  29,
  80,
  2.375,
  125.25,
  29.875,
  17.5,
  34.375,
  109.625,
  82.28571,
  87.5,
  104.75,
  60.625,
  79.875,
  51.57143,
  71
)

BestAvail = c(
  "GX38",
  "GX38",
  "GX38",
  "GX38",
  "GX38",
  "GX38",
  "GX38",
  "GX38",
  "GX39",
  "GZ33",
  "GZ42",
  "GZ42",
  "GZ42",
  "GZ42",
  "GZ42",
  "GZ42",
  "GZ42",
  "GZ42"
)

Best_Sets = c(
  91.75,
  91.75,
  91.75,
  91.75,
  91.75,
  91.75,
  91.75,
  91.75,
  54.125,
  30.5,
  76.76,
  76.75,
  76.75,
  76.75,
  76.75,
  76.75,
  76.75,
  76.75
)

dat <- data.frame(
  Closed_Grid, Closed_Sets, BestAvail, Best_Sets,
  stringsAsFactors = FALSE
)

# allocate a vector; this makes the for() loop use significantly
# less memory, see https://adv-r.hadley.nz/perf-improve.html#avoid-copies

dat$BestSum <- NA_real_

# split the data frame to work on one group of BestAvail at a time
Destination <- split(dat, factor(dat[["BestAvail"]]))

Destination <- lapply(Destination, function(dat) {
  for (i in seq_len(nrow(dat))) {

    BestSum <- rowSums(dat[i, c("Closed_Sets", "Best_Sets")])
    dat[i, "BestSum"] <- BestSum

    if (as.integer(i) > 1L) {
      if (BestSum < 150.0) {
        dat[i+1:(nrow(dat) - i), "Best_Sets"] <- dat[i, "BestSum"]
      }
    }

  }

  dat

})

# recombine
Destination <- do.call(rbind, Destination)

Destination

此代码可能会非常慢。如果您要在大型数据集上运行它，那么可能值得用c ++编写。

Answer 2

目标是您的数据框

尝试一下：

  library(tidyverse)
  Destination %>%
   arrange(BestAvail, BestDistance) %>%
   mutate(BestSum = Closed_Sets + Best_Sets) %>%
   group_by(BestAvail) %>%
   mutate(Best_Sets2 = case_when(BestSum < 150 ~ lag(BestSum),
                                 TRUE ~ Best_Sets))

Answer 3

因此，我尝试了所提供的几种解决方案，但不幸的是，我无法让它们真正为我工作。但是，我确实想出了一个解决方案（在我的主管的帮助下），它很丑陋而且很长，但是它可以使我得到想要的结果。

##Rearrange data from ascending order of Best Avail grids and Distance
Destination <- Destination %>% arrange(BestAvail, BestDistance)

####Set Levels and change to a factor in order to iterate through the different groups
Destination$Closed_Grid <- droplevels(Destination$Closed_Grid)
Destination$BestAvail <- as.factor(Destination$BestAvail)

###create a working file
WorkingDest = Destination[FALSE,]

##Loop that conditionally sums row by group, gives final dest grid, and pastes into working file
for (f in 1: nlevels (Destination$BestAvail)) {
  work <- subset(Destination, BestAvail == levels(Destination$BestAvail)[f])
  for (i in 1:nrow(work)){
    for (j in 1:length(levels(work$BestAvail))){
      if (as.character(work$BestAvail)[i] == as.character(levels(work$BestAvail)[j])){
        work$BestSum[i] <- sum(work$Closed_Sets[i], work$Best_Sets[i])
        if (work$BestSum[i] <= 150){
          work [i:nrow(work),"Best_Sets"] <- work$BestSum[i]
          work$Dest_Grid [i] <- as.character(work$BestAvail)[i]
        }
      }
    }
  }
  WorkingDest <- rbind(WorkingDest, work)
}
###Create Results DataFrame for Closed Sets that have Moved
FinalDestination <- WorkingDest[WorkingDest$Dest_Grid != 0,]

##Create a working df that only have the new base sets for matching purposes
MaxSetsBest <- WorkingDest %>%
  group_by(BestAvail) %>% top_n(1, Best_Sets)
MaxSetsBest <- MaxSetsBest[!duplicated(MaxSetsBest$BestAvail), ]

####change basesets for SecondBest based on previous iterations
for(id in 1:nrow(MaxSetsBest)){
  WorkingDest$Second_Sets[WorkingDest$SecondBest %in% MaxSetsBest$BestAvail[id]] <- MaxSetsBest$Best_Sets[id]
}
rm(id)

##Reset Destination with new basesets
Destination <- WorkingDest

##Remove Closed Grids that have moved from working file
Destination <- WorkingDest[!WorkingDest$Dest_Grid != 0,]

然后我为SecondBest选项重新运行相同的代码，依此类推。我知道这不是很好的编码，但是它可以正常工作，而且我有非常小的数据帧（最多50行）正在运行，因此速度并不是一个很大的因素。如果有人知道如何使它变得更好，但如果不是，那么它对我有用！

如何在R

3 个答案: