如何group_by,然后总结所有行中所有行均不适用的行

时间:2019-04-25 09:38:10

标签: r dplyr

我不确定是否可行。我希望能够使用摘要来对除group_by之外所有列中都具有NA的所有行进行计数。通过将所有5个条件放在NO_OL_Percent =所在的位置,然后将每个列与&连接起来,可以做到这一点。如果您可以使用SQL进行操作,我应该认为您可以使用dplyr或purrr进行操作,但似乎互联网上没有人尝试过此操作。

必须下载数据here

代码在下面。它可以工作,但是真的没有办法在最后一行代码中使用all函数吗?我需要首先能够进行group_by,并且不能在dplyr中使用filter_all。

farmers_market = read.csv("Export.csv", stringsAsFactors = F, na.strings=c("NA","NaN", ""))

farmers_market %>% 
        select(c("Website", "Facebook", "Twitter", "Youtube", "OtherMedia", "State")) %>%
        group_by(State) %>%
        summarise(Num_Markets = n(),
                  FB_Percent = 100 - 100*sum(is.na(Facebook))/n(), 
                  TW_Percent = 100 - 100*sum(is.na(Twitter))/n(),
                  #fb=sum(is.na(Facebook)),
                  OL_Percent = 100 - 100*sum(is.na(Facebook) & is.na(Twitter))/n(),
                  NO_OL_Percent = 100 - 100*sum(is.na(Facebook) & is.na(Twitter) & is.na(Website) & is.na(Youtube) & is.na(OtherMedia))/n()
                  )

2 个答案:

答案 0 :(得分:1)

由于我们正在总结,因此我删除了select语句,无论如何将只选择相关的列。从我们要计算cols的位置创建了一个NA向量。

我们首先检查每一行,如果该行在NA列中是否具有所有cols值,并将TRUE / FALSE的值分配给新列all_NA。然后,我们group_by State并按原样执行其余列的计算,但对于NO_OL_Percent,我们将ALL_NA求和以得出每组NA的总数并将其除以组中的总行数。

library(dplyr)

cols <- c("Website", "Facebook", "Twitter", "Youtube", "OtherMedia")

farmers_market %>% 
   mutate(all_NA = rowSums(is.na(.[cols])) == length(cols)) %>%
   group_by(State) %>%
   summarise(Num_Markets = n(),
             FB_Percent = 100 - 100*sum(is.na(Facebook))/n(), 
             TW_Percent = 100 - 100*sum(is.na(Twitter))/n(),
             OL_Percent = 100 - 100*sum(is.na(Facebook) & is.na(Twitter))/n(),
             NO_OL_Percent = 100 - 100*sum(all_NA)/n())


#    State                Num_Markets FB_Percent TW_Percent OL_Percent NO_OL_Percent
#    <chr>                      <int>      <dbl>      <dbl>      <dbl>         <dbl>
# 1 Alabama                      139       25.9       5.76       25.9          37.4
# 2 Alaska                        38       42.1      10.5        42.1          65.8
# 3 Arizona                       92       57.6      27.2        57.6          80.4
# 4 Arkansas                     111       52.3       4.50       52.3          61.3
# 5 California                   759       41.5      14.5        43.2          70.1
# 6 Colorado                     161       44.1       9.94       44.1          82.6
# 7 Connecticut                  157       33.8      12.1        33.8          53.5
# 8 Delaware                      36       61.1      11.1        61.1          83.3
# 9 District of Columbia          57       50.9      43.9        50.9          87.7
#10 Florida                      262       43.1       8.78       43.1          83.2
# … with 43 more rows

这将提供与当前方法相同的输出,但无需手动编写所有名称。

答案 1 :(得分:0)

获取Percent列的直接方法是:

farmers_market %>% 
    select("Website", "Facebook", "Twitter", "Youtube", "OtherMedia", "State") %>%
    group_by(State) %>% 
    summarise_all(funs("Percent" = sum(is.na(.))/n()))

# A tibble: 53 x 6
#  State   Website_Percent Facebook_Percent Twitter_Percent Youtube_Percent OtherMedia_Percent
#  <chr>             <dbl>            <dbl>           <dbl>           <dbl>              <dbl>
#1 Alabama           0.727            0.741           0.942           0.993              0.964
#2 Alaska            0.447            0.579           0.895           1                  0.974

要添加num_markets列,可以执行以下操作:

farmers_market %>% 
    select("Website", "Facebook", "Twitter", "Youtube", "OtherMedia", "State") %>%
    group_by(State) %>% 
    mutate(num_markets = n()) %>% 
    group_by(State, num_markets) %>% 
    summarise_all(funs("Percent" = sum(is.na(.))/n()))

# A tibble: 53 x 7
# Groups:   State [2]
#  State   num_markets Website_Percent Facebook_Percent Twitter_Percent Youtube_Percent OtherMedia_Percent
#  <chr>         <int>           <dbl>            <dbl>           <dbl>           <dbl>              <dbl>
#1 Alabama         139           0.727            0.741           0.942           0.993              0.964
#2 Alaska           38           0.447            0.579           0.895           1                  0.974