我需要合并两个数据框。每个的组成示例如下。这些是学区:第一个是收入,第二个是成绩。
School district revenue
Richland 1 8702
Richland 2 3749
Richland Board 892
Charleston 6324
Greenville 1245
Greenville Board 371
School district grade
Richland 1 A
Richland 2 A+
Charleston B
Greenville D
目标是合并这两个数据帧,并将最终结果聚合到与第二(等级)数据帧相同的级别。我必须做一个数据字典来合并它们,因为每个名称都不同(虽然我在这里简化了这个)但也处理聚合。我打算做的是按以下方式设置我的字典:
School_dist1 School_dist2
Richland 1 Richland 1
Richland 2 Richland 2
????? Richland Board
Charleston Charleston
Greenville Greenville
Greenville Greenville Board
然后我会简单地在school_dist1列上汇总。正如您所看到的,问题在于,虽然格林维尔董事会可以简单地汇总到格林维尔,但Richland董事会需要在两个Richland董事会中(均匀地)分开。
我尝试使用我能想到的每个可能的关键字搜索解决方案,但由于问题的奇怪性质而无法找到任何关键字。它的要点是,我需要汇总数据,但需要对一些观察结果进行拆分,然后与其他观察结果进行共享。
有没有办法做到这一点?我有意义吗?我完全被这个问题困住了。
答案 0 :(得分:2)
这是回家的路,但它会让你到那里......
# your data, dont use spaces in column names
df1 <- read.table(text = "School_district revenue
Richland_1 8702
Richland_2 3749
Richland_Board 892
Charleston 6324
Greenville 1245
Greenville_Board 371", header = T)
df2 <- read.table(text = "School_district grade
Richland_1 A
Richland_2 A+
Charleston B
Greenville D", header = T)
library(tidyverse)
# split df1 with boards and non-boards into separate dfs
boards <- dplyr::filter(df1, grepl("Board", df1$School_district)) %>%
dplyr::mutate(School_district = gsub("_Board", "", School_district))
df1 <- dplyr::filter(df1, !grepl("Board", df1$School_district))
# look up how many times a certain school district appears in df1
boards$num_splits <- map_int(boards$School_district,
~ grep(., df1$School_district) %>% length)
# add new column for revenue divided by number of appearances
boards <- transmute(boards,
match_name = School_district,
add_value = revenue / num_splits)
# if I knew how to use fuzzy_join you could probably drop this part
df1$match_name <- gsub("_.*", "", df1$School_district)
full_join(df1, boards) %>%
rowwise() %>%
mutate(new_revenue = sum(revenue, add_value, na.rm = T)) %>%
select(-match_name) %>%
full_join(df2)
# A tibble: 4 × 5
School_district revenue add_value new_revenue grade
<chr> <int> <dbl> <dbl> <fctr>
1 Richland_1 8702 446 9148 A
2 Richland_2 3749 446 4195 A+
3 Charleston 6324 NA 6324 B
4 Greenville 1245 371 1616 D