我有以下数据框:
dat <- read_csv(
"s1,s2,v1,v2
a,b,10,20
a,b,22,NA
a,b,13,33
c,d,3,NA
c,d,4.5,NA
c,d,10,20"
)
dat
#> # A tibble: 6 x 4
#> s1 s2 v1 v2
#> <chr> <chr> <dbl> <int>
#> 1 a b 10.0 20
#> 2 a b 22.0 NA
#> 3 a b 13.0 33
#> 4 c d 3.0 NA
#> 5 c d 4.5 NA
#> 6 c d 10.0 20
我想做的是
v1
值s1
和s2
v2
不是NA
的每个组中的行数。例如v1_filter >= 0
我们得到了这个:
s1 s2 total_line non_na_line
a b 3 2
c d 3 1
使用v1_filter >= 10
,我们得到了这个:
s1 s2 total_line non_na_line
a b 2 1
c d 1 1
如何使用data.table或dplyr实现这一目标?
实际上,我们在dat
中有大约31M行。所以我们需要
快速的方法。
我坚持这个
library(data.table)
dat <- data.table(dat)
v1_filter = 0
dat[, v1 >= v1_filter,
by=list(s1,s2)]
答案 0 :(得分:3)
使用sum
应该有所帮助。在逻辑向量上运行时,它会将每个TRUE
视为1
,将FALSE
视为0
,这样您就可以轻松完成此操作:
dat %>%
group_by(s1, s2) %>%
summarise(total_lines = n(),
non_na_line = sum(!is.na(v2)))
# A tibble: 2 x 4
# Groups: s1 [?]
s1 s2 total_lines non_na_line
<chr> <chr> <int> <int>
1 a b 3 2
2 c d 3 1
您可以轻松地在group_by
和summarise
之间添加过滤器,以获得您想要的内容。请注意,summarise
只会保留您分组的列。
为了它的价值,我运行了一个快速基准测试,其中包含一些与您相似的测试数据。
s1charMix <- rep(letters[seq(from = 1, to = 10)], length.out = 30000000)
s2charMix <- rep(letters[seq(from = 11, to = 20)], length.out = 30000000)
s1chars <- sample(s1charMix, 30000000)
s2chars <- sample(s2charMix, 30000000)
v1Nums <- runif(30000000, min = 0, max = 20)
nomissing <- sample(1:200000,1)
int.mix <- rbinom(30000000 - nomissing, 30, 0.3)
nalist <- rep(NA, nomissing)
v2NumsNA <- sample(x = c(int.mix, nalist), 30000000)
df <- data_frame(s1 = s1chars, s2 = s2chars, v1 = v1Nums, v2 = v2NumsNA)
这应该粗略地复制您建议的数据的大小和类型:
df
# A tibble: 30,000,000 x 4
s1 s2 v1 v2
<chr> <chr> <dbl> <int>
1 d s 9.2123603 7
2 b q 16.6638639 11
3 g o 18.3682028 11
4 g s 0.8779067 9
5 a s 0.0719127 10
6 b q 16.8809193 12
7 h q 15.4382455 6
8 e k 2.3565489 11
9 h p 16.4508811 9
10 d n 2.7283823 11
# ... with 29,999,990 more rows
df %>%
filter(is.na(v2))
# A tibble: 116,924 x 4
s1 s2 v1 v2
<chr> <chr> <dbl> <int>
1 d r 13.1448988 NA
2 b o 0.2703848 NA
3 b t 18.8319385 NA
4 a s 11.6448437 NA
5 j m 0.5388760 NA
6 i k 8.7098427 NA
7 d s 6.1149735 NA
8 h p 2.5552694 NA
9 g r 0.9057442 NA
10 b s 19.8886830 NA
# ... with 116,914 more rows
现在,让我们对dplyr
操作与data.table
进行基准测试:
### dplyr
df %>%
filter(v1 > 10) %>%
group_by(s1, s2) %>%
summarise(total_lines = n(),
non_na_line = sum(!is.na(v2)))
# A tibble: 100 x 4
# Groups: s1 [?]
s1 s2 total_lines non_na_line
<chr> <chr> <int> <int>
1 a k 150327 149734
2 a l 149655 149062
3 a m 149794 149200
4 a n 149771 149197
5 a o 149495 148942
...
> system.time(df %>% filter(v1 > 10) %>% group_by(s1, s2) %>% summarise(total_lines = n(), non_na_line = sum(!is.na(v2))))
user system elapsed
1.848 0.420 2.290
> system.time(for (i in 1:100) df %>% filter(v1 > 10) %>% group_by(s1, s2) %>% summarise(total_lines = n(), non_na_line = sum(!is.na(v2))))
user system elapsed
187.657 55.878 245.528
### Data.table
library(data.table)
dat <- data.table(df)
> dat[v1 > 10, .N, by = .(s1, s2)][dat[v1 > 10 & !is.na(v2), .N, by = .(s1, s2)] , on = c("s1", "s2") , nomatch = 0]
s1 s2 N i.N
1: b q 149968 149348
2: g o 150411 149831
3: h q 150132 149563
4: h p 150786 150224
5: e o 149951 149353
...
> system.time(dat[v1 > 10, .N, by = .(s1, s2)][dat[v1 > 10 & !is.na(v2), .N, by = .(s1, s2)] , on = c("s1", "s2") , nomatch = 0])
user system elapsed
2.027 0.228 2.271
> system.time(for (i in 1:100) dat[v1 > 10, .N, by = .(s1, s2)][dat[v1 > 10 & !is.na(v2), .N, by = .(s1, s2)] , on = c("s1", "s2") , nomatch = 0])
user system elapsed
213.281 43.949 261.664
TL; DR dplyr
和data.table
同样快,如果有任何dplyr
稍微快一点
答案 1 :(得分:2)
> library(readr)
> dat <- read_csv(
+ "s1,s2,v1,v2
+ a,b,10,20
+ a,b,22,NA
+ a,b,13,33
+ c,d,3,NA
+ c,d,4.5,NA
+ c,d,10,20"
+ )
>
> dat
# A tibble: 6 x 4
s1 s2 v1 v2
<chr> <chr> <dbl> <int>
1 a b 10.0 20
2 a b 22.0 NA
3 a b 13.0 33
4 c d 3.0 NA
5 c d 4.5 NA
6 c d 10.0 20
使用data.table,因为你有一个大数据
> library(data.table)
data.table 1.10.4
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> dat=data.table(dat)
不移除NA并将V1滤波器保持为0.1
> dat1=dat[v1>0.1,.N,.(s1,s2)]
> dat1
s1 s2 N
1: a b 3
2: c d 3
删除v2 NA并将V1过滤器保持为0.1
> dat2=dat[v1>0.1&is.na(v2)==F,.N,.(s1,s2)]
> dat2
s1 s2 N
1: a b 2
2: c d 1
合并两者并将V1过滤器保持为0
> dat[v1 > 0, .N, by = .(s1, s2)][ dat[v1 > 0 & !is.na(v2), .N, by = .(s1, s2)] , on = c("s1", "s2") , nomatch = 0 ]
s1 s2 N i.N
1: a b 3 2
2: c d 3 1