我有一个包含大量列的数据框。对于数据帧的每一行,我想得到NA的列数。问题是我只对一些列感兴趣,并希望(有效地)将这些列调出来。
使用mutate,我在下面的假样本中的方式给了我正确的答案。
library(stringr)
df <- data_frame(
id = 1:10
, name = fruit[1:10]
, word1 = c(words[1:5],NA,words[7:10])
, word2 = words[11:20]
, word3 = c(NA,NA,NA,words[25],NA,NA,words[32],NA,NA,words[65])
) %>%
mutate(
n_words =
as.numeric(!is.na(word1)) +
as.numeric(!is.na(word2)) +
as.numeric(!is.na(word3))
)
然而,即使像这样的玩具例子,打字和阅读也很痛苦 - 当我有超过3列的数量时,它基本上没用。是否有更多的R / dplyr-y方式来编写它,可能使用select()
样式语法(例如。n_words = !count_blank(word1:word3)
)?
我考虑使用summarize()
sans分组,但是,我需要我正在计算的列中的数据,如果我将它们添加到group_by
,我就在同一条船上呼唤所有专栏。
答案 0 :(得分:5)
您可以对所选列使用is.na()
,然后rowSums()
结果:
library(stringr)
df <- data_frame(
id = 1:10
, name = fruit[1:10]
, word1 = c(words[1:5],NA,words[7:10])
, word2 = words[11:20]
, word3 = c(NA,NA,NA,words[25],NA,NA,words[32],NA,NA,words[65]))
df$word_count <- rowSums( !is.na( df [,3:5]))
df
id name word1 word2 word3 n_words
<int> <chr> <chr> <chr> <chr> <dbl>
1 1 apple a actual <NA> 2
2 2 apricot able add <NA> 2
3 3 avocado about address <NA> 2
4 4 banana absolute admit agree 3
5 5 bell pepper accept advertise <NA> 2
6 6 bilberry <NA> affect <NA> 1
7 7 blackberry achieve afford alright 3
8 8 blackcurrant across after <NA> 2
9 9 blood orange act afternoon <NA> 2
10 10 blueberry active again awful 3
使用dplyr
你可以这样做:
df %>%
select(3:5) %>%
is.na %>%
`!` %>%
rowSums
答案 1 :(得分:1)
另一个dplyr
解决方案:
library(stringr)
## define count function
count_na <- function(x) sum(!is.na(x))
df$count_na <- df %>%
select(starts_with("word")) %>%
apply(., 1, count_na)
## A tibble: 10 × 6
id name word1 word2 word3 n_words
<int> <chr> <chr> <chr> <chr> <int>
1 1 apple a actual <NA> 2
2 2 apricot able add <NA> 2
3 3 avocado about address <NA> 2
4 4 banana absolute admit agree 3
5 5 bell pepper accept advertise <NA> 2
6 6 bilberry <NA> affect <NA> 1
7 7 blackberry achieve afford alright 3
8 8 blackcurrant across after <NA> 2
9 9 blood orange act afternoon <NA> 2
10 10 blueberry active again awful 3
答案 2 :(得分:1)
library(dplyr)
library(stringr)
df <- data_frame(
id = 1:10
, name = fruit[1:10]
, word1 = c(words[1:5],NA,words[7:10])
, word2 = words[11:20]
, word3 = c(NA,NA,NA,words[25],NA,NA,words[32],NA,NA,words[65])
)
library(purrr)
# Rowwise sum of NAs
df %>% by_row(~ sum(is.na(.)), .collate = 'cols')
# Rowwise sum of non-NAs for word columns
df %>%
select(starts_with('word')) %>%
by_row(~ sum(!is.na(.)), .collate = 'cols')