我很难与这个人在一起...所以我试图在每个组中找到彼此接近的点,然后对它们进行分组。让我根据以下示例数据向您解释:
Group X Y Z
1 110 3762 431 10
2 112 4950 880 10
3 113 5062 873 20
4 113 5225 874 30
5 113 5262 875 10
6 113 5300 874 20
structure(list(Group = c(110, 112, 113, 113, 113, 113), X = c(3762,
4950, 5062, 5225, 5262, 5300), Y = c(431, 880, 873, 874, 875,
874), Z = c(10, 10, 20, 30, 10, 20)), row.names = c(NA, -6L), class = "data.frame")
我们可以看到我们对分组列 Group , X 和 Y 进行了分组,列是我们的坐标,而 Z 当将点定义为“关闭”(欧氏距离<100)时,应进一步汇总该列。
我尝试过的事情:
我使用此函数成功计算了点之间的欧几里得距离:
for(i in 1:nrow(test)) {
if(i > 1 && test$Group[i] == test$Group[i-1]) {
test$Distance[i] <- sqrt(((test$X[i] - test$X[i-1]) ^ 2) + ((test$Y[i] - test$Y[i-1]) ^ 2))
} else {
test$Distance[i] <- NA
}
}
这是给我的:
Group X Y Z Distance
1 110 3762 431 10 NA
2 112 4950 880 10 NA
3 113 5062 873 20 NA
4 113 5225 874 30 163.00307
5 113 5262 875 10 37.01351
6 113 5300 874 20 38.01316
在这里,一切都变得复杂了,因为每个组的第一行都有NA。...
我想实现的目标:
我想找到每个组的距离不大于 100 (距离 <100)的点,并以此为基础进行总结( Z 列)。所以手动完成:
Group Z Grouped
1 110 10 no
2 112 10 no
3 113 20 no
4 113 60 yes
感谢帮助!
答案 0 :(得分:2)
那很难。我不确定我是否已经完全弄清楚了。
#get data and libraries
library(tidyverse)
df <- read.table(text = "
Group X Y Z Distance
1 110 3762 431 10 NA
2 112 4950 880 10 NA
3 113 5062 873 20 NA
4 113 5225 874 30 163.00307
5 113 5262 875 10 37.01351
6 113 5300 874 20 38.01316", header = T, stringsAsFactors = F)
df %>%
group_by(Group) %>%
do(melt(outer(.$Distance, .$Distance, `-`))) %>%
filter(between(value, -100, 0) | between(value, 0, 100)) %>%
distinct(Var1) %>%
mutate(grouped = 1) %>%
rename(row = Var1) -> rows
df %>%
group_by(Group) %>%
mutate(row = row_number()) %>%
left_join(rows, by = c("row", "Group")) %>%
mutate(grouped = ifelse(is.na(grouped), "no", "yes")) %>%
group_by(Group, grouped) %>%
mutate(Z = ifelse(!is.na(grouped), sum(Z), Z)) %>%
distinct(Group, Z, grouped)
# A tibble: 4 x 3
# Groups: Group, grouped [4]
Group Z grouped
<int> <int> <chr>
1 110 10 no
2 112 10 no
3 113 20 no
4 113 60 yes
希望这就是您正在寻找的东西,如果没有,也许会给您一些新的想法。
更新:现在,我希望对您有所帮助:
df %>%
group_by(Group) %>%
mutate(int1 = lead(Distance) < 100 | Distance < 100,
int1 = replace(int1, is.na(int1), FALSE),
int2 = rleid(int1),
int2 = replace(int2, !int1 | is.na(int1), NA)) -> df2
df2 %>%
filter(int1) %>%
group_by(Group, int2) %>%
summarise(Z = sum(Z),
Grouped = "yes") %>%
select(Group, Z, Grouped) %>%
bind_rows(df2 %>%
filter(!int1) %>%
mutate(Grouped = "no") %>%
select(Group, Z, Grouped)) %>%
arrange(Group)
# A tibble: 4 x 3
# Groups: Group [3]
Group Z Grouped
<int> <int> <chr>
1 110 10 no
2 112 10 no
3 113 60 yes
4 113 20 no
答案 1 :(得分:1)
我设计了一个小用例,可以帮助您入门。这是一种基于列向量的for循环和聚合的基本方法,您可以将成对的函数向量应用于列进行聚合。
df <- read.table(text = "
Group X Y Z Distance
1 110 3762 431 10 NA
2 112 4950 880 10 NA
3 113 5062 873 20 NA
4 113 5225 874 30 163.00307
5 113 5262 875 10 37.01351
6 113 5300 874 20 38.01316
7 114 5300 874 30 NA
8 114 5300 874 20 38.01316", header = T, stringsAsFactors = F)
aggregateIt <- function(df = data, #data.frame
returnRaw = F, #to get the raw unaggregted df (only first case from column `grouped` by `subgroup` usable in this application)
colsToAgg = c("Z1", "Z2", "Z3"), #cols to aggregate
how = c("sum", "sum", "max")) #how to aggregate the columns, `Z1` by sum, `Z2` by sum and `Z3` by max
{
count <- 1L
result <- vector("integer", nrow(df))
grouped <- vector("character", nrow(df))
for(i in seq_len(length(result)-1L)){
if(df$Group[i] != df$Group[i+1L]) {
result[i] <- count
grouped[i] <- "no"
count <- count + 1L
if((i+1L) == length(result)) {
result[i+1L] <- count
grouped[i+1L] <- "no"
}
} else {
if(df$Distance[i+1L] > 100L) {
result[i] <- count
grouped[i] <- "no"
count <- count + 1L
if((i+1L) == length(result)) {
result[i+1L] <- count
grouped[i+1L] <- "no"
}
} else {
result[i] <- count
grouped[i] <- "yes"
if((i+1L) == length(result)) {
result[i+1L] <- count
grouped[i+1L] <- "yes"
}
}
}
}
df <- within(df, {subgroup <- result; grouped <- grouped})
if(returnRaw) return(df)
A <- Reduce(function(a, b) merge(a, b, by = "subgroup"),
lapply(seq_along(how), function(x) aggregate(.~subgroup, df[, c(colsToAgg[x], "subgroup")], how[x])))
B <- df[!duplicated(df$subgroup, fromLast = F), c("Group", "subgroup", "grouped")]
out <- merge(A, B, by = "subgroup")
return(out[, c("Group", colsToAgg, "grouped")])
}
aggregateIt(df = df, colsToAgg = "Z", how = "sum")
# Group Z grouped
#1 110 10 no
#2 112 10 no
#3 113 20 no
#4 113 60 yes
#5 114 50 yes
没有断言这是最有效的解决方案,但指出了解决方案。希望这会有所帮助!