从Web访问日志中的请求中提取训练集和测试集

时间:2017-12-26 14:00:29

标签: r dplyr

我有这个DF: -

df = structure(list(session_id = c(1105L, 1105L, 1105L, 1107L, 1107L, 
1107L, 1108L, 1108L, 1108L, 1109L, 1109L, 1109L, 1110L, 1110L, 
1110L, 1111L, 1111L, 1111L, 1111L, 1112L, 1112L, 1112L, 1112L, 
1114L, 1114L, 1114L, 1114L), datetime = structure(c(1457483622, 
1457483623, 1457483625, 1457484264, 1457484266, 1457484269, 1457484842, 
1457484844, 1457484846, 1457485297, 1457485299, 1457485300, 1457485369, 
1457485369, 1457485371, 1457486315, 1457486316, 1457486316, 1457486318, 
1457486477, 1457486480, 1457486480, 1457486481, 1457486997, 1457486997, 
1457486998, 1457487001), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    request = c(8, 3, 3, 14, 14, 7, 9, 10, 10, 17, 6, 6, 10, 
    8, 5, 9, 11, 14, 16, 21, 11, 1, 19, 7, 4, 13, 20)), .Names = c("session_id", 
"datetime", "request"), row.names = c(NA, -27L), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"))

我试图通过session_id对这些数据进行分组,50%的请求进入训练集(Train),其余50%进入测试集(测试)。

期望的输出: - enter image description here

如你所见,session_id = 1105包含3个条目,所以我们把它分成一半(50%),它给出1.5我们将其约​​为2(下一个正整数)...所以在Train col我们有8,3并且在Test中col包含3 ...........并对rest session_ids执行相同的操作

1 个答案:

答案 0 :(得分:1)

我们可以使用包中的sample_frac函数。 slice(1:round(n() * 0.5))用于指定前50%行的样本。创建df_train后,我们可以使用anti_join创建df_test

library(dplyr)

# Create ID by row and group data by session_id
df <- df %>% 
  mutate(ID = 1:n()) %>%
  group_by(session_id)

# Take the first 50% sample of each group
df_train <- df %>%
  slice(1:round(n() * 0.5)) %>%
  ungroup()

# Filter out those records 
df_test <- df %>%
  anti_join(df_train, by = "ID") %>%
  ungroup()

head(df_train)
# # A tibble: 6 x 4
#   session_id            datetime request    ID
#        <int>              <dttm>   <dbl> <int>
# 1       1105 2016-03-09 00:33:42       8     1
# 2       1105 2016-03-09 00:33:43       3     2
# 3       1107 2016-03-09 00:44:24      14     4
# 4       1107 2016-03-09 00:44:26      14     5
# 5       1108 2016-03-09 00:54:02       9     7
# 6       1108 2016-03-09 00:54:04      10     8

head(df_test)
# A tibble: 6 x 4
#   session_id            datetime request    ID
#        <int>              <dttm>   <dbl> <int>
# 1       1105 2016-03-09 00:33:45       3     3
# 2       1107 2016-03-09 00:44:29       7     6
# 3       1108 2016-03-09 00:54:06      10     9
# 4       1109 2016-03-09 01:01:40       6    12
# 5       1110 2016-03-09 01:02:51       5    15
# 6       1111 2016-03-09 01:18:36      14    18