桌子上的grep数量?

时间:2018-05-29 17:17:21

标签: r dplyr

假设我有一个包含单列的表格:

品种

Cara Cara  
Mandarin  
Seville  
Juice Orange  
Tangerine  
(... 100+ varieties)  

我有一张表,其中一列列出了每个果园生长的品种,用逗号分隔:

果园名称 | 城市名称 | 州名 | 品种增长

Orchard 1 | City | State | Cara Cara, Mandarin, Juice, Tangerine  
Orchard 2 | City | State | Cara Cara  
Orchard 3 | City | State | Seville  
(... 1,000+ orchards)

创建新表的最有效方法是什么,其中包含每个品种种植果园数量的汇总计数:

Orchard | 计数

Cara Cara | 521  
Seville | 470  

(等等,对于100多个品种中的每一个)

提前致谢!

1 个答案:

答案 0 :(得分:0)

我们可以使用separate_rows中的tidyrVarieties.grown拆分为多行,然后右键加入varieties表格,只获取Varieties我们&#39}。感兴趣。最后,group_by(Varieties.grown)并计算所有非NA Orchard.NameCount获取Varieties

library(dplyr)
library(tidyr)

df %>%
  separate_rows(Varieties.grown, sep = "\\s?,\\s?") %>%
  right_join(varieties, by = c("Varieties.grown"="Varieties")) %>%
  group_by(Varieties.grown) %>%
  summarize(Count = sum(!is.na(Orchard.Name))) %>%
  rename(Varieties = Varieties.grown)

<强>结果:

# A tibble: 5 x 2
     Varieties Count
         <chr> <int>
1    Cara Cara     2
2 Juice Orange     0
3     Mandarin     1
4      Seville     1
5    Tangerine     1

数据:

df = structure(list(Orchard.Name = c("Orchard 1", "Orchard 2", "Orchard 3"
), City.Name = c("City", "City", "City"), State.Name = c("State", 
"State", "State"), Varieties.grown = c("Cara Cara, Mandarin, Juice, Tangerine", 
"Cara Cara", "Seville")), class = "data.frame", .Names = c("Orchard.Name", 
"City.Name", "State.Name", "Varieties.grown"), row.names = c(NA, 
-3L))

varieties = structure(list(Varieties = c("Cara Cara", "Mandarin", "Seville", 
"Juice Orange", "Tangerine")), .Names = "Varieties", row.names = c(NA, 
-5L), class = "data.frame")