我想基于唯一的ID号来汇总(城市之间的)搬迁。具有两个唯一ID的示例数据框:
year ID city adress
1 2013 1 B adress_1
2 2014 1 B adress_1
3 2015 1 A adress_2
4 2016 1 A adress_2
5 2013 2 B adress_3
6 2014 2 B adress_3
7 2015 2 C adress_4
8 2016 2 C adress_4
我在下面提供了示例代码。总结是正确的,除了一件事。例如,如果在城市B和城市A之间找到了重定位,我希望输出从城市B到城市A的重定位输出(并且次数1 =在数据框中看到一次)。但是,由于摘要函数的特性(以及倾向于按字母顺序存储输出的趋势),我得到以下输出
tmp <- df %>% group_by(ID, city, adress) %>% summarize(numberofyears = n())
tmp <- tmp %>%
group_by(ID) %>%
#filter(n() >1) %>%
mutate(from = city[1], from_adres = adress[1], from_years = numberofyears[1], to = city[2],
to_adres = adress[2], to_years = numberofyears[2]) %>%
distinct(ID, .keep_all = TRUE) %>% select(-c(2:3))
# A tibble: 2 x 8
# Groups: ID [2]
ID numberofyears from from_adres from_years to to_adres to_years
<dbl> <int> <fct> <fct> <int> <fct> <fct> <int>
1 1 2 A adress_2 2 B adress_1 2
2 2 2 B adress_3 2 C adress_4 2
这是错误的,因为我们知道adress_1位于adress_2之前。总结从B市到C市的搬迁,我得到了正确的结果。
这是一个很小的细节,但正如我试图演示的那样,是一个重要的细节。任何建议将不胜感激!
答案 0 :(得分:1)
喜欢吗?
library(tidyverse)
df<-read.table(text=" year ID city adress
1 2013 1 B adress_1
2 2014 1 B adress_1
3 2015 1 A adress_2
4 2016 1 A adress_2
5 2013 2 B adress_3
6 2014 2 B adress_3
7 2015 2 C adress_4
8 2016 2 C adress_4",header=T)
df%>%
group_by(ID, city, adress)%>%
summarize(numberofyears = n())%>%
mutate(id=parse_number(adress))%>%
group_by(ID,id)%>%
arrange(id)%>%
ungroup()%>%
select(-id)%>%
group_by(ID)%>%
mutate(from=first(city), from_adres = first(adress),
from_years = first(numberofyears),to=last(city),
to_adres = last(adress),to_years=last(numberofyears))%>%
distinct(ID, .keep_all = TRUE)%>%
select(-c(2:3))
# A tibble: 2 x 8
# Groups: ID [2]
ID numberofyears from from_adres from_years to to_adres to_years
<int> <int> <fct> <fct> <int> <fct> <fct> <int>
1 1 2 B adress_1 2 A adress_2 2
2 2 2 B adress_3 2 C adress_4 2
答案 1 :(得分:1)
类似于@jyjek,但这将允许每个ID进行一次以上移动的可能性。
spark_sklearn