Question

I have two datasets: inds (individuals) and hhs (households).

I'm trying to sum all inds$income grouped by UID (unique ID) and create a new column on hhs with the sum of the income of all individuals in a households. Some individuals don't have any income, in which case they have 'NA' for that variable. The code I'm using is:

hhs <- left_join(hhs, inds %>% group_by(UID) %>% summarize(hhincome = sum(income, na.rm=TRUE)))

However, it's extremely slow. inds has over 2 million rows, hhs has about 550k rows. I've used dplyr to average or count similar variables in the same datasets, and it usually takes ten seconds or less to get this done. Is there anything I'm doing wrong? Is there any way to make it faster?

Edit: as I said, I've used dplyer to get mean household values without a problem with

hhs <- left_join(hhs, inds %>% filter(AGE > 2) %>% group_by(UID) %>% summarize(L_Bilingual = mean(Bilingual, na.rm=TRUE)))

It takes 5 seconds to calculate the means with the code above. Is there anything particular to the function sum() that makes it slower?

Answer 1

With the fake data below, the summarize-and-join takes about 2 seconds on my machine, which is a new Macbook Pro. Even with a slower machine, it shouldn't take longer than maybe 10 or 15 seconds. Is it possible that the join is creating many more rows than you think? If you provide more details about the structure of your data, we can be more specific about what might be going wrong.

library(tidyverse)
library(microbenchmark)

# Generate two data frames with only the UID column in common
set.seed(2)
hhs = data.frame(UID=1:550000, replicate(30, runif(550000))) %>% 
  set_names(c("UID", paste0("V",1:30)))
inds = data.frame(UID=sample(1:550000, 2e6, replace=TRUE), 
                  income=rnorm(2e6, 5e4, 1e4),
                  replicate(20, rnorm(2e6, 5e4, 1e4)))

microbenchmark(join=left_join(hhs, inds %>% group_by(UID) %>% 
                                summarize(hhincome = sum(income, na.rm=TRUE))),
               times=5)

Unit: seconds
 expr      min       lq     mean median       uq      max neval
 join 1.924749 1.988773 2.722018 2.0063 2.068044 5.622223     5

Answer 2

这就是让它变慢的原因：我没有意识到inds $ income具有标签：

> head(inds$income)
<Labelled double>: Earned income
[1]      0      0      0      0 258000      0

Labels:
value                  label
 99999998       Unknown/missing.
 99999999 NIU (not in universe).

当我删除标签（使用as.numeric（）对该列进行编码）时，问题已修复。

dplyr: summarize by group (sum) is extremely slow

2 个答案: