我正在使用R从Google AnalyticsAPI获取一些数据。在这个特定的场景中,我获得了有关按性别和年龄段划分的用户的亲和关注度的信息。我得到的数据结构看起来类似于:
gender ageGroup interest sessions
male 18-24 Autos 4
male 18-24 Autos/Luxury 1
male 18-24 Autos/Vans 1
male 25-34 Autos 8
male 25-34 Autos/Luxury 2
male 25-34 Autos/Vans 2
male 25-34 Autos/Compacts 1
...
female 65+ Fashion 20
然而,这种结构的问题是Autos,因为主要的兴趣还包括子类别的会话,如果我在数据透视表中使用这些数据,我将得到错误的信息。
因此,我正在添加子类别" Generalists"将每个主要类别作为自己的子类别,并将此列拆分为两个:
for (i2 in 1:nrow(ga.genderAgeAffinityTable) ) {
# main categories <- chrFound = integer(0)
chrFound <- grep("[/]", ga.genderAgeAffinityTable$interest[i2] )
if (length(chrFound) < 1) {
ga.genderAgeAffinityTable$interest[i2] <-
sprintf("%s/Generalists", ga.genderAgeAffinityTable$interest[i2])
}
ga.genderAgeAffinityTable <- as.data.frame
(cSplit(ga.genderAgeAffinityTable, "interest", sep = "/"))
}
View(ga.genderAgeAffinityTable)
gender ageGroup interest subcategory sessions
male 18-24 Autos Generalists 4
male 18-24 Autos Luxury 1
male 18-24 Autos Vans 1
male 25-34 Autos Generalists 8
male 25-34 Autos Luxury 2
male 25-34 Autos Vans 2
male 25-34 Autos Compacts 1
...
female 65+ Fashion Generalists 20
我仍然必须摆脱错误的会话计算,对于第一组(男性,18-24岁,汽车爱好者),通才应该只有2个会话(会话 - 总和(其他子类别))。我正在使用auxId(genderAgeInterestSubcategory),通过该auxId汇总所有会话,将聚合会话合并为我的数据帧中的新列,并重新计算子类别的会话&#34; Generalists&#34;:
ga.genderAgeAffinityTable$auxId <- sprintf("%s%s%s",
ga.genderAgeAffinityTable$gender, ga.genderAgeAffinityTable$age,
ga.genderAgeAffinityTable$interest_1 )
ga.interestAggregated <- aggregate(ga.genderAgeAffinityTable[,c("sessions")],
by=list(ga.genderAgeAffinityTable$auxId), "sum")
colnames(ga.interestAggregated) <- c("auxId", "aggregated")
ga.genderAgeAffinityTable <- (merge(ga.genderAgeAffinityTable,
ga.interestAggregated, by = 'auxId'))
for (i3 in 1:nrow(ga.genderAgeAffinityTable) ) {
if (ga.genderAgeAffinityTable$interest_2[i3] == "Generalists" ) {
# Do not recalculate sessions for interests with only Generalists as subcategory
if (ga.genderAgeAffinityTable$aggregated[i3] -
ga.genderAgeAffinityTable$sessions[i3] != 0 ) {
ga.genderAgeAffinityTable$sessions[i3] <-
ga.genderAgeAffinityTable$aggregated[i3] -
ga.genderAgeAffinityTable$sessions[i3]
}
}
}
您是否知道在不使用auxid的情况下更直接的方法?
答案 0 :(得分:3)
你看过data.table
包吗?它具有令人惊叹的总结功能,可以帮助您。
e.g。
library(data.table)
results <- DT[ , sum(sessions), by = subcategory]
# would give you total sessions per sub interest
# which could help you subset when you then focus on Generalists.
# to do multiple groups you would use by = .(gender, subcategory)
您可以使用以下命令创建列以访问子集:=。 data.table
在右手中非常强大,可以防止你需要做的所有循环。您需要键入数据。
我还是初学者,所以其他人可能会在下面提供更有效的代码。
请查看data.table wiki和cheatsheet。 DT专家/传说Matt和@Arun在SO上非常活跃,如果你选择这条路线,他们很可能会参与其中,并且可能会帮助你。
我们可能需要有关如何转换数据的更多详细信息。即“通才应该只有2个会议”请确认您对输出的期望。您是否只需要每个性别/年龄组的输出/每个通才的净会话的兴趣?
为了帮助其他人投放,以下是使用dput
library(data.table)
DT <- data.table(gender = c("male", "male", "male", "male", "male","male", "male"),
ageGroup = c("18-24", "18-24", "18-24", "25-34","25-34", "25-34", "25-34"),
interest = c("Autos", "Autos", "Autos","Autos", "Autos", "Autos", "Autos"),
subcategory = c("Generalists","Luxury", "Vans", "Generalists", "Luxury", "Vans", "Compacts"),
sessions = c(4L, 1L, 1L, 8L, 2L, 2L, 1L) )
分阶段构建,以帮助解释并向您展示data.table的强大功能。这将获得除通才以外的所有内容。
notgensum <- DT[subcategory != "Generalists", mysum := sum(sessions),
by = .(gender, ageGroup, interest)]
gender ageGroup interest subcategory sessions mysum
1: male 18-24 Autos Generalists 4 NA
2: male 18-24 Autos Luxury 1 2
3: male 18-24 Autos Vans 1 2
4: male 25-34 Autos Generalists 8 NA
5: male 25-34 Autos Luxury 2 5
6: male 25-34 Autos Vans 2 5
7: male 25-34 Autos Compacts 1 5
进一步说,我们减去非通才数(我使用平均忽略的NA来获得此数字)关闭通才的会话数。这使得myadjsessions:2为第一个(4 -2)和3个25-34男性汽车,如你所愿。
genadjsum2 <- notgensum[, myadjsessions := (sessions - mean(mysum, na.rm = T)),
by = .(gender, ageGroup, interest)]
# gender ageGroup interest subcategory sessions mysum myadjsessions
#1: male 18-24 Autos Generalists 4 NA 2
#2: male 18-24 Autos Luxury 1 2 -1
#3: male 18-24 Autos Vans 1 2 -1
#4: male 25-34 Autos Generalists 8 NA 3
#5: male 25-34 Autos Luxury 2 5 -3
#6: male 25-34 Autos Vans 2 5 -3
#7: male 25-34 Autos Compacts 1 5 -4
Data.table
可以被链接,即DT [do this] [和this],所以如果你只想要通才的结果。
genadjsum3 <- notgensum[,
myadjsessions := (sessions - mean(mysum, na.rm = T)),
by = .(gender, ageGroup, interest)][subcategory == "Generalists"]
# gender ageGroup interest subcategory sessions mysum myadjsessions
#1: male 18-24 Autos Generalists 4 NA 2
#2: male 25-34 Autos Generalists 8 NA 3
最后,如果你想摆脱mysum临时列,语法是
genadjsum3[, mysum := NULL]
你会爱上没有循环!