我有一个包含数千行的数据框,但下面给出了一个示例:
userid event
1 123 view
2 123 view
3 123 order
4 345 view
5 345 view
6 345 view
7 345 order
8 111 view
9 111 order
10 111 view
11 111 view
12 111 view
13 333 view
14 333 view
15 333 view
dput(数据)
structure(list(userid = c(123, 123, 123, 345, 345, 345, 345,
111, 111, 111, 111, 111, 333, 333, 333), eventaction = structure(c(2L,
2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("order",
"view"), class = "factor")), .Names = c("userid", "event"
), row.names = c(NA, -15L), class = "data.frame")
我正在做的是在事件下提取包含“order”一词的所有userid行。结果将包含userid的所有行,不包括userid = 333,因为eventaction不包含订单条目。
第二项任务是计算订单输入前“查看”的出现次数。我将非常感谢帮助和指示。
感谢。
答案 0 :(得分:3)
我们可以尝试使用data.table
。将'data.frame'转换为'data.table'(setDT(data)
),按'userid'分组,if
有any
'event',它是'order'中的'order' 'userid',返回Data.table的子集'{(.SD
)
library(data.table)
setDT(data)[,if(any(event=="order")) .SD , by = userid]
或者使用dplyr
,我们{'1}}在'eventid'分组后的'event'中filter
'order'。{/ p>
any
答案 1 :(得分:1)
执行第二项任务时可能会有userid
的多个订单:
library(dplyr)
df %>% group_by(userid) %>%
mutate(row_num = row_number()) %>%
filter(event=="order") %>%
mutate(num_views_before=c(first(row_num),diff(row_num))-1)
注意:
group_by
userid
。diff
在预先创建的行号上计算每个订单之前的观看次数。为了测试,我修改了您的数据,将第12行中的事件更改为“order”,以便userid=111
有两个订单。
修改数据:
structure(list(userid = c(123, 123, 123, 345, 345, 345, 345,
111, 111, 111, 111, 111, 333, 333, 333), event = structure(c(2L,
2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L), .Label = c("order",
"view"), class = "factor")), .Names = c("userid", "event"), row.names = c(NA,
-15L), class = "data.frame")
## userid event
##1 123 view
##2 123 view
##3 123 order
##4 345 view
##5 345 view
##6 345 view
##7 345 order
##8 111 view
##9 111 order
##10 111 view
##11 111 view
##12 111 order
##13 333 view
##14 333 view
##15 333 view
有了这些数据,我们得到:
##Source: local data frame [4 x 4]
##Groups: userid [3]
##
## userid event row_num num_views_before
## <dbl> <fctr> <int> <dbl>
##1 123 order 3 2
##2 345 order 4 3
##3 111 order 2 1
##4 111 order 5 2
答案 2 :(得分:0)
使用标准 R,如果您向data.frame致电mydat
:
myusers <- mydat[mydat$event == "order", "userid"]
mydat[mydat$userid %in% myusers,]
答案 3 :(得分:0)
你可以这样做:
df[df$userid %in% df[df$event=="order",]$userid,]
或subset
:
subset(df, df$userid %in% subset(df, event=="order")$userid)
OR match
功能:
subset(df, match(df$userid, subset(df, event=="order")$userid, nomatch = 0)>0)
或使用sqldf
库:
library(sqldf)
sqldf("select * from df where df.userid in (select df.userid from df where df.event=='order')")
# userid event
# 1 123 view
# 2 123 view
# 3 123 order
# 4 345 view
# 5 345 view
# 6 345 view
# 7 345 order
# 8 111 view
# 9 111 order
# 10 111 view
# 11 111 view
# 12 111 view