我的桌子看起来像这样:
df <- read.table(text =
" Day location gender hashtags
'Feb 19 2016' 'UK' 'M' '#a'
'Feb 19 2016' 'UK' 'M' '#b'
'Feb 19 2016' 'SP' 'F' '#a'
'Feb 19 2016' 'SP' 'F' '#b'
'Feb 19 2016' 'SP' 'M' '#a'
'Feb 19 2016' 'SP' 'M' '#b'
'Feb 20 2016' 'UK' 'F' '#a'",
header = TRUE, stringsAsFactors = FALSE)
我想按天/标签/位置和性别计算频率,结果表如下所示:
Day hashtags Daily_Freq men women Freq_UK Freq_SP
Feb 19 2016 #a 3 2 1 1 2
Feb 19 2016 #b 3 2 1 1 1
Feb 20 2016 #a 1 0 1 1 0
其中Daily_freq =男性+女性= Freq_UK + Freq_SP 我怎么能这样做?
答案 0 :(得分:6)
使用dplyr
:
library(dplyr)
df %>%
group_by(Day, hashtags) %>%
summarise(Daily_Freq = n(),
men = sum(gender == 'M'),
women = sum(gender == 'F'),
Freq_UK = sum(location == 'UK'),
Freq_SP = sum(location == 'SP'))
给出:
# A tibble: 3 x 7 # Groups: Day [?] Day hashtags Daily_Freq men women Freq_UK Freq_SP <chr> <chr> <int> <int> <int> <int> <int> 1 Feb 19 2016 #a 3 2 1 1 2 2 Feb 19 2016 #b 3 2 1 1 2 3 Feb 20 2016 #a 1 0 1 1 0
data.table
中实现的逻辑相同:
library(data.table)
setDT(df)[, .(Daily_Freq = .N,
men = sum(gender == 'M'),
women = sum(gender == 'F'),
Freq_UK = sum(location == 'UK'),
Freq_SP = sum(location == 'SP'))
, by = .(Day, hashtags)]
答案 1 :(得分:4)
单程......
library(data.table)
setDT(df)
df[, gender := as.factor(gender)]
df[, location := as.factor(location)]
df[, c(
N = .N,
dcast(.SD, . ~ gender, fun.agg = length, drop=FALSE)[, !"."],
dcast(.SD, . ~ location, fun.agg = length, drop=FALSE)[, !"."]
), by=.(Day, hashtags)]
# Day hashtags N F M SP UK
# 1: Feb 19 2016 #a 3 1 2 2 1
# 2: Feb 19 2016 #b 3 1 2 2 1
# 3: Feb 20 2016 #a 1 1 0 0 1
以这种方式编码可能更容易维护:不需要手动分配列名;地点和性别将根据它们是否出现在原始数据中而显示或退出结果;和列名称不需要在多个位置输入(转换为因子后)。
如果国家/地区代码与性别代码匹配,则这种方式会产生重复的列。绕过那个:
df[, c(
N = .N,
gender = dcast(.SD, . ~ gender, fun.agg = length, drop=FALSE)[, !"."],
loc = dcast(.SD, . ~ location, fun.agg = length, drop=FALSE)[, !"."]
), by=.(Day, hashtags)]
# Day hashtags N gender.F gender.M loc.SP loc.UK
# 1: Feb 19 2016 #a 3 1 2 2 1
# 2: Feb 19 2016 #b 3 1 2 2 1
# 3: Feb 20 2016 #a 1 1 0 0 1
答案 2 :(得分:3)
使用包reshape2
。
library(reshape2)
molten <- melt(df, id.vars = c("Day", "hashtags"))
result <- dcast(molten, Day + hashtags ~ variable + value, length)
result$Daily_Freq <- rowSums(result[, c("location_SP", "location_UK")])
result
# Day hashtags location_SP location_UK gender_F gender_M Daily_Freq
#1 Feb 19 2016 #a 2 1 1 2 3
#2 Feb 19 2016 #b 2 1 1 2 3
#3 Feb 20 2016 #a 0 1 1 0 1
请注意,列不是示例输出的顺序。重新排序它们很简单。