以下是我为dplyr编写的问题的代表:
library(tidyverse)
df <- tibble(State = c("A", "A", "A", "A", "A", "A", "B", "B", "B"),
District_code = c(1:9),
District = c("North", "West", "North West", "South", "East", "South East",
"XYZ", "ZYX", "AGS"),
Population = c(1000000, 2000000, 3000000, 4000000, 5000000, 6000000,
7000000, 8000000, 9000000))
df
#> # A tibble: 9 x 4
#> State District_code District Population
#> <chr> <int> <chr> <dbl>
#> 1 A 1 North 1000000
#> 2 A 2 West 2000000
#> 3 A 3 North West 3000000
#> 4 A 4 South 4000000
#> 5 A 5 East 5000000
#> 6 A 6 South East 6000000
#> 7 B 7 XYZ 7000000
#> 8 B 8 ZYX 8000000
#> 9 B 9 AGS 9000000
对于某些州,我需要将使用名称的地区合并到较少的地理类别中。特别是,状态A仅应具有:“北-西-西北”和“南-东-东南”。必须添加一些变量,例如人口。但其他人,例如District_code应该获得NA。我发现this example行之间的操作,但不是完全一样。 Grouping似乎不适用。
最终结果应该是这样的:
new_df
#> # A tibble: 5 x 4
#> State District_code District Population
#> <chr> <int> <chr> <dbl>
#> 1 A NA North - West - North West 5000000
#> 2 A NA South - East - South East 15000000
#> 3 B 7 XYZ 7000000
#> 4 B 8 ZYX 8000000
#> 5 B 9 AGS 9000000
在实际数据框中,必须添加一些变量(如“人口”)以及许多其他变量(例如“区号”),这些变量必须获取NA值。
感谢您的帮助!
答案 0 :(得分:4)
您可以使用fct_collapse
指定新的因子水平,然后在新组上使用summarise
。
df %>%
mutate(District =
fct_collapse(District,
"North - West - North West" = c("North", "West", "North West"),
"South - East - South East" = c("South", "East", "South East"))) %>%
group_by(State, District) %>%
summarise(Population = sum(Population),
District_code = ifelse(n() > 1, NA_real_, District_code))
# A tibble: 5 x 3
# Groups: State [?]
# State District Population
# <chr> <fct> <dbl>
# 1 A South - East - South East 15000000
# 2 A North - West - North West 6000000
# 3 B AGS 9000000
# 4 B XYZ 7000000
# 5 B ZYX 8000000
如果您只想更改某些特定州的地区,则可以像这样添加case_when
或if_else
,并在列的类型上设置汇总功能(此处为“人口”的两倍)反对该地区的整数)
df %>%
mutate(District =
case_when(State == "A" ~
fct_collapse(District,
"North - West - North West" = c("North", "West", "North West"),
"South - East - South East" = c("South", "East", "South East")),
TRUE ~ factor(District))) %>%
group_by(State, District) %>%
summarise_all(funs({if(is.double(.)) {
sum(.)
} else {
if (length(unique(.)) > 1) {
NA
} else {
unique(.)
}
}}))
答案 1 :(得分:2)
对于某些州,我需要将使用名称的地区合并到较少的地理类别中。特别是,状态A仅应具有:“北-西-西北”和“南-东-东南”。
您需要写下分组规则,例如...
merge_rules = list(
list(State = "A", District = c("North", "West", "North West")),
list(State = "A", District = c("South", "East", "South East"))
)
必须添加一些变量,例如人口。但是其他人,例如District_code应该获得NA。
我可以通过将合并规则放在表格中来做到这一点;合并后进行计算;并在未合并的行上进行rbind-ing。这是data.table的方式...
library(data.table)
DT = data.table(df)
mDT = rbindlist(lapply(merge_rules, as.data.table), id = "g")
gDT = DT[mDT, on=.(State, District)][, .(
District_code = District_code[NA_integer_],
District = paste(District, collapse = " - "),
Population = sum(Population)
), by=.(g, State)]
rbind(
DT[!mDT, on=.(State, District)],
gDT[, !"g"]
)[order(State, District)]
State District_code District Population
1: A NA North - West - North West 6.0e+06
2: A NA South - East - South East 1.5e+07
3: B 9 AGS 9.0e+06
4: B 7 XYZ 7.0e+06
5: B 8 ZYX 8.0e+06
而且,我想,tidyverse的方式是类似的:
mtib = bind_rows(lapply(merge_rules, as.tibble), .id = "g")
gtib = right_join(df, mtib, by=c("State", "District")) %>%
group_by(g, State) %>% summarise(
District_code = District_code[NA_integer_],
District = paste(District, collapse = " - "),
Population = sum(Population)
)
bind_rows(
anti_join(df, mtib, by=c("State", "District")),
gtib %>% ungroup %>% select(-g)
) %>% arrange(State, District)
# A tibble: 5 x 4
State District_code District Population
<chr> <int> <chr> <dbl>
1 A NA North - West - North West 6000000
2 A NA South - East - South East 15000000
3 B 9 AGS 9000000
4 B 7 XYZ 7000000
5 B 8 ZYX 8000000
答案 2 :(得分:0)
这是获取州A的总人口的一种方法:
df %>%
filter(State == "A") %>%
mutate(`North - West - North West` = (District == "North"|District == "West"|District == "North West"),
`South - East - South East` = (District == "South"|District == "East"|District == "South East")) %>%
gather(key = Districts, value = present, 5:6) %>%
filter(present != FALSE) %>%
group_by(Districts) %>%
summarise(Population = sum(Population))
哪个给出输出:
Districts Population
<chr> <dbl>
1 North - West - No… 6000000
2 South - East - So… 15000000
有人应该能够帮助我们将以上内容放入原始df中。