R - 按组名称汇总并重新计算列

时间:2015-04-14 14:07:49

标签: r

我正在使用R从Google AnalyticsAPI获取一些数据。在这个特定的场景中,我获得了有关按性别和年龄段划分的用户的亲和关注度的信息。我得到的数据结构看起来类似于:

gender ageGroup interest        sessions
male   18-24    Autos           4
male   18-24    Autos/Luxury    1
male   18-24    Autos/Vans      1
male   25-34    Autos           8
male   25-34    Autos/Luxury    2
male   25-34    Autos/Vans      2
male   25-34    Autos/Compacts  1
...
female 65+      Fashion         20

然而,这种结构的问题是Autos,因为主要的兴趣还包括子类别的会话,如果我在数据透视表中使用这些数据,我将得到错误的信息。

因此,我正在添加子类别" Generalists"将每个主要类别作为自己的子类别,并将此列拆分为两个:

for (i2 in 1:nrow(ga.genderAgeAffinityTable) ) {

# main categories <- chrFound = integer(0)            
chrFound <- grep("[/]", ga.genderAgeAffinityTable$interest[i2] )

if (length(chrFound) < 1) {
ga.genderAgeAffinityTable$interest[i2] <- 
sprintf("%s/Generalists", ga.genderAgeAffinityTable$interest[i2])
}

ga.genderAgeAffinityTable <- as.data.frame
(cSplit(ga.genderAgeAffinityTable, "interest", sep = "/"))

}

View(ga.genderAgeAffinityTable)

            gender ageGroup interest        subcategory        sessions
            male   18-24    Autos           Generalists        4
            male   18-24    Autos           Luxury             1
            male   18-24    Autos           Vans               1
            male   25-34    Autos           Generalists        8
            male   25-34    Autos           Luxury             2
            male   25-34    Autos           Vans               2
            male   25-34    Autos           Compacts           1
            ...
            female 65+      Fashion         Generalists        20

我仍然必须摆脱错误的会话计算,对于第一组(男性,18-24岁,汽车爱好者),通才应该只有2个会话(会话 - 总和(其他子类别))。我正在使用auxId(genderAgeInterestSubcategory),通过该auxId汇总所有会话,将聚合会话合并为我的数据帧中的新列,并重新计算子类别的会话&#34; Generalists&#34;:

ga.genderAgeAffinityTable$auxId <- sprintf("%s%s%s",
ga.genderAgeAffinityTable$gender, ga.genderAgeAffinityTable$age,
ga.genderAgeAffinityTable$interest_1 )

ga.interestAggregated <- aggregate(ga.genderAgeAffinityTable[,c("sessions")],
by=list(ga.genderAgeAffinityTable$auxId), "sum")

colnames(ga.interestAggregated) <- c("auxId", "aggregated")

ga.genderAgeAffinityTable <- (merge(ga.genderAgeAffinityTable,
ga.interestAggregated, by = 'auxId'))

for (i3 in 1:nrow(ga.genderAgeAffinityTable) ) {

if (ga.genderAgeAffinityTable$interest_2[i3] == "Generalists" ) {

# Do not recalculate sessions for interests with only Generalists as subcategory          
if (ga.genderAgeAffinityTable$aggregated[i3] -
ga.genderAgeAffinityTable$sessions[i3] != 0 ) {

ga.genderAgeAffinityTable$sessions[i3] <-
ga.genderAgeAffinityTable$aggregated[i3] -
ga.genderAgeAffinityTable$sessions[i3]
}

}

}

您是否知道在不使用auxid的情况下更直接的方法?

1 个答案:

答案 0 :(得分:3)

你看过data.table包吗?它具有令人惊叹的总结功能,可以帮助您。

e.g。

library(data.table)
results <- DT[ , sum(sessions), by = subcategory]
# would give you total sessions per sub interest
#  which could help you subset when you then focus on Generalists.
#  to do multiple groups you would use by = .(gender, subcategory)

您可以使用以下命令创建列以访问子集:=。 data.table在右手中非常强大,可以防止你需要做的所有循环。您需要键入数据。

我还是初学者,所以其他人可能会在下面提供更有效的代码。

请查看data.table wikicheatsheet。 DT专家/传说Matt和@Arun在SO上非常活跃,如果你选择这条路线,他们很可能会参与其中,并且可能会帮助你。

我们可能需要有关如何转换数据的更多详细信息。即“通才应该只有2个会议”请确认您对输出的期望。您是否只需要每个性别/年龄组的输出/每个通才的净会话的兴趣?

DATA

为了帮助其他人投放,以下是使用dput

的前两个类别的数据
library(data.table)
DT <- data.table(gender = c("male", "male", "male", "male", "male","male", "male"), 
ageGroup = c("18-24", "18-24", "18-24", "25-34","25-34", "25-34", "25-34"),
interest = c("Autos", "Autos", "Autos","Autos", "Autos", "Autos", "Autos"),
subcategory = c("Generalists","Luxury", "Vans", "Generalists", "Luxury", "Vans", "Compacts"), 
sessions = c(4L, 1L, 1L, 8L, 2L, 2L, 1L) )

分阶段构建,以帮助解释并向您展示的强大功能。这将获得除通才以外的所有内容。

notgensum <- DT[subcategory  != "Generalists", mysum := sum(sessions),
                by = .(gender, ageGroup, interest)]

    gender ageGroup interest subcategory sessions mysum
1:   male    18-24    Autos Generalists        4    NA
2:   male    18-24    Autos      Luxury        1     2
3:   male    18-24    Autos        Vans        1     2
4:   male    25-34    Autos Generalists        8    NA
5:   male    25-34    Autos      Luxury        2     5
6:   male    25-34    Autos        Vans        2     5
7:   male    25-34    Autos    Compacts        1     5

进一步说,我们减去非通才数(我使用平均忽略的NA来获得此数字)关闭通才的会话数。这使得myadjsessions:2为第一个(4 -2)和3个25-34男性汽车,如你所愿。

genadjsum2 <- notgensum[, myadjsessions := (sessions - mean(mysum, na.rm = T)),
                        by = .(gender, ageGroup, interest)]

#   gender ageGroup interest subcategory sessions mysum myadjsessions   
#1:   male    18-24    Autos Generalists        4    NA             2
#2:   male    18-24    Autos      Luxury        1     2            -1
#3:   male    18-24    Autos        Vans        1     2            -1
#4:   male    25-34    Autos Generalists        8    NA             3
#5:   male    25-34    Autos      Luxury        2     5            -3
#6:   male    25-34    Autos        Vans        2     5            -3
#7:   male    25-34    Autos    Compacts        1     5            -4

Data.table可以被链接,即DT [do this] [和this],所以如果你只想要通才的结果。

genadjsum3 <- notgensum[, 
             myadjsessions := (sessions - mean(mysum, na.rm = T)),
             by = .(gender, ageGroup, interest)][subcategory  == "Generalists"]

#  gender ageGroup interest subcategory sessions mysum myadjsessions
#1:   male    18-24    Autos Generalists        4    NA             2
#2:   male    25-34    Autos Generalists        8    NA             3

最后,如果你想摆脱mysum临时列,语法是

genadjsum3[, mysum := NULL]

你会爱上没有循环!