我有一个数据框(可能没有像这样排序),如下所示:
Group Value
A 1
A 5
A 6
A 11
B 3
B 4
B 5
B 10
现在我想要一个新列,它计算每个组中有多少行的值,这些行的值在每行的值的固定范围内(比如说这个例子,它必须比当前值小2到2之间)行的值和实际值(包括)。所以结果将是
Group Value New Count
A 1 1 (because there is only 1 row in Group A between -1 and 1, this row)
A 5 1 (because there is only 1 row in Group A between 3 and 5, this row)
A 6 2 (because there are 2 rows in Group A between 4 and 6)..and so on
A 11 1
B 3 1
B 4 2
B 5 3
B 10 1
我已经看到了关于在一个小组中运行总计数器的一些答案,但我在搜索SO时没有遇到过这种情况......
答案 0 :(得分:1)
另一种方法是在连接条件上使用非等连接和分组:
library(data.table)
setDT(DF)[, New.Count := .SD[.(Group = Group, V1 = Value, V2 = Value - delta),
on = .(Group, Value <= V1, Value >= V2), .N, by = .EACHI]$N][]
Group Value New.Count 1: A 1 1 2: A 5 1 3: A 6 2 4: A 11 1 5: B 3 1 6: B 4 2 7: B 5 3 8: B 10 1
library(data.table)
DF <- fread(
" Group Value
A 1
A 5
A 6
A 11
B 3
B 4
B 5
B 10"
)
答案 1 :(得分:0)
我找到了一种循环方式,不知道该怎么做:
Df <- data.frame(list(Value = c(1,5,8,11,3,4,5,10), Group = c("A","A","A","A","B","B","B","B")))
for (i in 1:dim(Df)[1])
{Df$newcount[i] <- sum(as.numeric(Df$Value <=Df$Value[i] & Df$Value >= Df$Value[i]-2 & Df$Group == Df$Group[i] )) }
它在每一行上循环并计算你所说的条件:值和值之间的值 - 2,并在同一组中。 我正在寻找一种data.table方式,但没有管理它。 输出:
Value Group newcount
1 1 A 1
2 5 A 1
3 8 A 1
4 11 A 1
5 3 B 1
6 4 B 2
7 5 B 3
8 10 B 1
答案 2 :(得分:0)
您可以使用purrr
实现这一目标,但也许有更简洁的方法。我们首先使用我们将搜索的范围创建一个新变量。接下来,我们找到给定组的所有唯一值。对于结果,我们将所有落入搜索范围的值的计数相加。我们可以将它包装在一个函数中并以方便的方式重用。
library(tidyverse)
find_counts <- function(x, range = 2) {
search_range <- map(x, ~seq(.x-range, .x, 1))
unique_vals <- list(x)
map2_int(unique_vals, search_range, ~sum(.x %in% .y))
}
Df %>%
group_by(Group) %>%
mutate(result = find_counts(Value))
#> # A tibble: 8 x 3
#> # Groups: Group [2]
#> Group Value result
#> <fctr> <int> <dbl>
#> 1 A 1 1
#> 2 A 5 1
#> 3 A 8 1
#> 4 A 11 1
#> 5 B 3 1
#> 6 B 4 2
#> 7 B 5 3
#> 8 B 10 1
microbenchmark::microbenchmark
的结果包含以下数据:
set.seed(928374)
DF <- data.frame(Group = sample(letters[1:15], 500, replace = T),
Value = sample(1:10, 500, replace = T))
Unit: milliseconds
expr min lq mean median uq max neval cld
ANG 1607.59370 1645.93364 1776.582546 1709.976584 1822.011283 2603.61574 30 c
ThomasK 15.30110 16.11919 19.040010 17.238959 19.550713 54.30369 30 a
denis 155.92567 165.73500 182.563020 171.147209 204.508171 253.26394 30 b
uwe 2.15669 2.46198 3.207837 2.570449 3.114574 13.28832 30 a
Df <- read.table(text = " Group Value
A 1
A 5
A 8
A 11
B 3
B 4
B 5
B 10", header = T)
答案 3 :(得分:0)
根据你的开始(在你的评论中提到),这是循环来做到这一点
df <- data.frame(Group = c(rep("A", 4), rep("B", 4)),
Value = c(1, 5, 6, 11, 3, 4, 5, 10))
require(dplyr)
for(i in seq_along(df$Value)){
df$NewCount[i] <- nrow(df %>% filter(Group == Group[i] &
Value <= Value[i] &
Value >= Value[i]-2))
}
答案 4 :(得分:0)
只有R:
count_in_range = function(x){
delta = 2
vapply(x,
FUN = function(value) sum(x>=(value - delta) & x<=value, na.rm = TRUE),
FUN.VALUE = numeric(1)
)
}
dfs$newcount = ave(dfs$Value, dfs$Group, FUN = count_in_range)
dfs
# Group Value newcount
# 1 A 1 1
# 2 A 5 1
# 3 A 6 2
# 4 A 11 1
# 5 B 3 1
# 6 B 4 2
# 7 B 5 3
# 8 B 10 1
使用data.table进行基准测试:
set.seed(928374)
DF <- data.frame(Group = sample(letters[1:15], 500, replace = T),
Value = sample(1:10, 500, replace = T))
library(data.table)
library(microbenchmark)
DT = as.data.table(DF)
delta = 2
microbenchmark(
datatable = {
DT[, New.Count := .SD[.(Group = Group, V1 = Value, V2 = Value - delta),
on = .(Group, Value <= V1, Value >= V2), .N, by = .EACHI]$N][]
},
ave = {
DF$newcount = ave(DF$Value, DF$Group, FUN = count_in_range)
}
)
# Unit: microseconds
# expr min lq mean median uq max neval
# datatable 1424.814 1438.3355 1492.9422 1459.2175 1512.100 1914.575 100
# ave 712.708 737.1955 849.0507 756.7265 789.327 3583.369 100
all.equal(DF$newcount, DT$New.Count) # TRUE