假设我有一个包含单列的表格:
品种
Cara Cara
Mandarin
Seville
Juice Orange
Tangerine
(... 100+ varieties)
我有一张表,其中一列列出了每个果园生长的品种,用逗号分隔:
果园名称 | 城市名称 | 州名 | 品种增长
Orchard 1 | City | State | Cara Cara, Mandarin, Juice, Tangerine
Orchard 2 | City | State | Cara Cara
Orchard 3 | City | State | Seville
(... 1,000+ orchards)
创建新表的最有效方法是什么,其中包含每个品种种植果园数量的汇总计数:
Orchard | 计数
Cara Cara | 521
Seville | 470
(等等,对于100多个品种中的每一个)
提前致谢!
答案 0 :(得分:0)
我们可以使用separate_rows
中的tidyr
将Varieties.grown
拆分为多行,然后右键加入varieties
表格,只获取Varieties
我们&#39}。感兴趣。最后,group_by(Varieties.grown)
并计算所有非NA Orchard.Name
以Count
获取Varieties
:
library(dplyr)
library(tidyr)
df %>%
separate_rows(Varieties.grown, sep = "\\s?,\\s?") %>%
right_join(varieties, by = c("Varieties.grown"="Varieties")) %>%
group_by(Varieties.grown) %>%
summarize(Count = sum(!is.na(Orchard.Name))) %>%
rename(Varieties = Varieties.grown)
<强>结果:强>
# A tibble: 5 x 2
Varieties Count
<chr> <int>
1 Cara Cara 2
2 Juice Orange 0
3 Mandarin 1
4 Seville 1
5 Tangerine 1
数据:强>
df = structure(list(Orchard.Name = c("Orchard 1", "Orchard 2", "Orchard 3"
), City.Name = c("City", "City", "City"), State.Name = c("State",
"State", "State"), Varieties.grown = c("Cara Cara, Mandarin, Juice, Tangerine",
"Cara Cara", "Seville")), class = "data.frame", .Names = c("Orchard.Name",
"City.Name", "State.Name", "Varieties.grown"), row.names = c(NA,
-3L))
varieties = structure(list(Varieties = c("Cara Cara", "Mandarin", "Seville",
"Juice Orange", "Tangerine")), .Names = "Varieties", row.names = c(NA,
-5L), class = "data.frame")