Question

我试图对数据系列进行回归。我想分析在实际销售之前客户会发生什么。

对于购买的客户，我希望在每个客户的销售日期前两周停止收集数据。

对于没有购买的客户，我想使用所有可用的数据。

我使用下面的代码（日期的格式是我从系统获取数据的方式我提取数据）：

library(dplyr)

# create the sample data frame
df <- data.frame(
  client_id = c(1,1,2,3,3,3,4,5),
  int_type = c("chat", "chat", "chat", "chat", "chat", "sale", "sale", "chat"),
  int_date = c("03OCT2017:17:07:59.000", "06OCT2017:16:50:55.000", "07MAR2017:10:29:02.000",
               "13FEB2017:06:02:07.000", "16APR2017:17:20:36.000", "22APR2017:13:04:12.000",
               "25JUN2017:12:45:33.000", "27JUN2017:15:02:04.000")
  )

# create a column converting strings to dates
df$int_date_posix <- as.POSIXct(df$int_date, format = "%d%b%Y:%H:%M:%S")

# group and summarize to get sale dates
df <- group_by(df, client_id)
df2 <- summarize(df, dt_sale = max(int_date_posix[int_type=="sale"]))

# merge with original data frame
df <- merge(df, df2, by="client_id", all.x=T, all.y=F)
rm(df2)

df <- mutate(df, int_from_sale = difftime(dt_sale, int_date_posix, units="days"))

# filter out everything that happened after two weeks prior to sales
df2 <- df[which(
  df$int_from_sale>14 |
  is.na(df$dt_sale)
  ),]

这正确地过滤了所购买客户的数据，但不包括任何无法购买的客户，即使我在过滤器中包含了is.na术语。

我可以看到is.na不明白dt_sale第一行中的NA值实际上是NA。

> df[1,"dt_sale"]
[1] NA
> is.na(df[1,"dt_sale"])
[1] FALSE

我无法找出一个为此值返回TRUE的函数，因此我可以根据需要过滤数据框。

Answer 1

如上所述定义df2时遇到问题：

> df2 <- summarize(df, dt_sale = max(int_date_posix[int_type=="sale"]))
Warning messages:
1: In max.default(numeric(0), na.rm = FALSE) :
  no non-missing arguments to max; returning -Inf

由于这种情况，您可以使用is.infinite()：

> is.infinite(df$dt_sale)
[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE

缺少is.na

1 个答案: