Question

我有一个数据框，我想按用户分组并找到数量的总和。

library(data.table)
x = read.table('C:/Users/user/Desktop/20180911_Dataset_b.csv',encoding = 'UTF-8',sep =',')

dt = data.table(x)

colnames(dt)
"dates_d" "user" "proj" "quantity"

列quantity如下：

quantity
1
34
12
13
3
12
-
11
1

我听说data.table library非常快，所以我想使用它。

我已经在Python中做到了，但是不知道如何在R中做到这一点。

Answer 1

由于历史内存限制问题，R读取数据作为因素。当一列中有类似字符的条目时，整个列将作为字符向量读入。现在有了更容易获得的RAM，您可以先将数据作为字符串读入，这样它就可以保留为字符向量而不是因数。

然后使用as.numeric转换为实数值，然后再求和。无法转换为数字的字符串将转换为NA。 na.rm=TRUE忽略总和中的NA。

采取上述所有措施：

library(data.table)
#you might want to check out the data.table::fread function to read the data directly as a data.table
x = read.table('C:/Users/user/Desktop/20180911_Dataset_b.csv',encoding = 'UTF-8',sep =',', stringsAsFactors=FALSE)

setDT(x)[, sum(as.numeric(quantity), na.rm=TRUE), by=.(user)]

参考： phiver在Is there any good reason for columns to be characters instead of factors?中的有用评论链接到Roger Peng的博客： https://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/

Answer 2

library(dplyr)

dt[dt == "-" ] = NA

df <- dt %>% group_by(user) %>%
        summarise(qty = sum(!is.na(quantity)))

使用sum

2 个答案: