如何在dplyr语句中排除非数字列

时间:2017-03-06 22:06:23

标签: r dplyr tidyr

我正在尝试解决以下问题:查找Pop_Size_Group的每个值的每个数字列的平均值。我需要找出一种排除任何非数字变量的有效方法。

这是我到目前为止所拥有的:

library(dplyr)
    df <- tbl_df(Demographics)

df %>%
  group_by(Pop_Size_Group) %>%
  summarise_each(funs(mean(., na.rm = TRUE)))

代码生成:

> df <- tbl_df(Demographics)
> df %>%
+   group_by(Pop_Size_Group) %>%
+   summarise_each(funs(mean(., na.rm = TRUE)))

# A tibble: 3 × 18
  Pop_Size_Group County_name State Region_num Location Square_miles Population Pct_Age18_to_34 Pct_65_or_over
           <chr>       <lgl> <lgl>      <dbl>    <lgl>        <dbl>      <dbl>           <dbl>          <dbl>
1          Large          NA    NA   2.492958       NA    1239.3099   847193.0        28.96338       12.06197
2         Medium          NA    NA   2.465409       NA     861.3711   224348.6        28.30252       12.31572
3          Small          NA    NA   2.424460       NA    1045.1871   121956.6        28.46906       12.11295
# ... with 9 more variables: Num_physicians <dbl>, Num_hospital_beds <dbl>, Num_serious_crimes <dbl>,
#   Pct_High_Sch_grads <dbl>, Pct_Bachelors <dbl>, Pct_below_poverty <dbl>, Pct_unemployed <dbl>,
#   Per_cap_income <dbl>, Total_personal_income <dbl>


Warning messages:
1: In mean.default(c("Los_Angeles", "Cook", "Harris", "San_Diego",  :
  argument is not numeric or logical: returning NA
2: In mean.default(c("Pulaski", "Guilford", "Solano", "York", "Berks",  :
  argument is not numeric or logical: returning NA
3: In mean.default(c("Bibb", "Onslow", "Jackson", "Schenectady", "Rock_Island",  :
  argument is not numeric or logical: returning NA
4: In mean.default(c("CA", "IL", "TX", "CA", "CA", "NY", "AZ", "MI",  :
  argument is not numeric or logical: returning NA
5: In mean.default(c("AR", "NC", "CA", "PA", "PA", "NH", "TN", "FL",  :
  argument is not numeric or logical: returning NA
6: In mean.default(c("GA", "NC", "MI", "NY", "IL", "OH", "CA", "ME",  :
  argument is not numeric or logical: returning NA
7: In mean.default(c("West", "East", "West", "West", "West", "East",  :
  argument is not numeric or logical: returning NA
8: In mean.default(c("West", "East", "West", "East", "East", "East",  :
  argument is not numeric or logical: returning NA
9: In mean.default(c("East", "East", "East", "East", "East", "East",  :
  argument is not numeric or logical: returning NA

以下是glimpse(df)的输出供参考:

> glimpse(df)
Observations: 440
Variables: 18
$ County_name           <chr> "Los_Angeles", "Cook", "Harris", "San_Diego", "Orange", "Kings", "Maricopa", "W...
$ State                 <chr> "CA", "IL", "TX", "CA", "CA", "NY", "AZ", "MI", "FL", "TX", "PA", "WA", "CA", "...
$ Region_num            <int> 4, 2, 3, 4, 4, 1, 4, 2, 3, 3, 1, 4, 4, 4, 2, 1, 1, 1, 1, 4, 3, 3, 4, 3, 2, 4, 2...
$ Location              <chr> "West", "East", "West", "West", "West", "East", "West", "East", "East", "West",...
$ Square_miles          <int> 4060, 946, 1729, 4205, 790, 71, 9204, 614, 1945, 880, 135, 2126, 1291, 20062, 4...
$ Population            <int> 8863164, 5105067, 2818199, 2498016, 2410556, 2300664, 2122101, 2111687, 1937094...
$ Pop_Size_Group        <chr> "Large", "Large", "Large", "Large", "Large", "Large", "Large", "Large", "Large"...
$ Pct_Age18_to_34       <dbl> 32.1, 29.2, 31.3, 33.5, 32.6, 28.3, 29.2, 27.4, 27.1, 32.6, 29.1, 30.1, 32.6, 3...
$ Pct_65_or_over        <dbl> 9.7, 12.4, 7.1, 10.9, 9.2, 12.4, 12.5, 12.5, 13.9, 8.2, 15.2, 11.1, 8.7, 8.8, 1...
$ Num_physicians        <int> 23677, 15153, 7553, 5905, 6062, 4861, 4320, 3823, 6274, 4718, 6641, 5280, 4101,...
$ Num_hospital_beds     <int> 27700, 21550, 12449, 6179, 6369, 8942, 6104, 9490, 8840, 6934, 10494, 4009, 334...
$ Num_serious_crimes    <int> 688936, 436936, 253526, 173821, 144524, 680966, 177593, 193978, 244725, 214258,...
$ Pct_High_Sch_grads    <dbl> 70.0, 73.4, 74.9, 81.9, 81.2, 63.7, 81.5, 70.0, 65.0, 77.1, 64.3, 88.2, 82.0, 7...
$ Pct_Bachelors         <dbl> 22.3, 22.8, 25.4, 25.3, 27.8, 16.6, 22.1, 13.7, 18.8, 26.3, 15.2, 32.8, 32.6, 1...
$ Pct_below_poverty     <dbl> 11.6, 11.1, 12.5, 8.1, 5.2, 19.5, 8.8, 16.9, 14.2, 10.4, 16.1, 5.0, 5.0, 10.3, ...
$ Pct_unemployed        <dbl> 8.0, 7.2, 5.7, 6.1, 4.8, 9.5, 4.9, 10.0, 8.7, 6.1, 8.0, 4.6, 5.5, 8.0, 5.5, 7.3...
$ Per_cap_income        <int> 20786, 21729, 19517, 19588, 24400, 16803, 18042, 17461, 17823, 21001, 16721, 23...
$ Total_personal_income <int> 184230, 110928, 55003, 48931, 58818, 38658, 38287, 36872, 34525, 38911, 26512, ...

以下是数据的链接: example data

1 个答案:

答案 0 :(得分:5)

您可以使用dplyr&#39; select_if功能:

df %>% select_if(is.numeric)

或正如Mislav在评论中所建议的那样,使用summarise_if直接查看摘要。

df %>% 
  group_by(Pop_Size_Group) %>% 
  summarise_if(is.numeric, mean, na.rm = TRUE)