Question

我有一个多线图，显示了各种线路形式的20位客户的收入。

我使用了以下代码：

library(dplyr)
trainingSummary <- top20CustomersRevenue %>% group_by(custno, TrainingDate) %>%
  summarize(Revenue = first(Revenue),
            TrainingType = first(TrainingType))
trainingSummary$TrainingType <- as.factor(trainingSummary$TrainingType)

p <- ggplot() + geom_line(data=top20CustomersRevenue,aes(x=DeltaMonth,y=Revenue,group=custno),alpha=0.3) +
  theme_bw() +
  ylab('Revenue (Dollars)') + xlab('') + theme(legend.title=element_blank()) +
  theme(legend.title=element_blank(),axis.text.y=element_text(hjust=0, angle=0), 
        axis.text.x = element_text(hjust=1, angle=45),plot.title=element_text(size=20))
p <- p + geom_point(data = trainingSummary,
               aes(x = TrainingDate, y = Revenue, color= TrainingType))
p

并得到以下情节：

enter image description here

我的数据格式如下：

custno  TrainingType    Revenue TrainingDate    DeltaMonth
250 Webinar 4146.80 2013-02-26  2013-01-01
250 Webinar 6211.93 2013-02-26  2013-02-01
250 Webinar 2199.72 2013-02-26  2013-03-01
250 Webinar 4452.65 2013-02-26  2013-04-01
250 Webinar 4787.83 2013-02-26  2013-05-01
250 Webinar 4004.80 2013-02-26  2013-06-01
250 Webinar 4806.69 2013-02-26  2013-07-01

示例 - 在上面的数据集中，我想在custno TrainingDate处的2013-02-26 250对应的行添加一个勾号。

以下是dput(head(top20CustomersRevenue))：

的结果

structure(list(custno = c(250L, 250L, 250L, 250L, 250L, 250L), 
    TrainingType = structure(c(5L, 5L, 5L, 5L, 5L, 5L), .Label = c("In-person", 
    "In person", "In Person", "webinar", "Webinar", "Webinar "
    ), class = "factor"), Revenue = c(4146.8, 6211.93, 2199.72, 
    4452.65, 4787.83, 4004.8), TrainingDate = structure(c(1361865600, 
    1361865600, 1361865600, 1361865600, 1361865600, 1361865600
    ), class = c("POSIXct", "POSIXt"), tzone = ""), DeltaMonth = structure(c(1357027200, 
    1359705600, 1362124800, 1364799600, 1367391600, 1370070000
    ), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("custno", 
"TrainingType", "Revenue", "TrainingDate", "DeltaMonth"), row.names = c(NA, 
6L), class = "data.frame")

我有20个不同客户的数据，其中custno和TrainingDate不同。

如何确保这些点位于正确的位置，而不是悬挂在空中？

非常感谢这方面的任何帮助。

更新

@Gregor - 非常感谢你非常有帮助的回答。我仍然面临ceiling_date的问题：

以下是我原始数据的一部分：

[889] "2013-02-01" "2013-02-01" "2013-02-01" "2013-02-01" "2013-02-01" "2013-02-01"
[895] "2013-02-01" "2013-02-01" "2013-02-01" "2013-02-01" "2013-02-01" "2013-02-01"
[901] "2013-02-01" "2013-02-01" "2013-02-01" "2013-02-01" "2013-02-01" "2013-02-01"
[907] "2013-02-01" "2013-02-01" "2013-02-01" "2013-02-01" "2013-02-01" "2013-02-01"

运行ceiling_date(top20CustomersRevenue$TrainingDate + months(1), unit = "month")后，这是相同的部分：

[889] NA           NA           NA           NA           NA           NA          
[895] NA           NA           NA           NA           NA           NA          
[901] NA           NA           NA           NA           NA           NA          
[907] NA           NA           NA           NA           NA           NA

查看生成NA的代码，我运行了以下语句，但没有生成NA。

> ceiling_date(as.Date("2013-02-01")+months(1),unit="month")
[1] "2013-03-01"

为什么这种行为上的差异？你有什么想法吗？

Answer 1

现在这是未经测试的，因为我在R会话忙于运行模型时回答这个问题，但我认为它会起作用：

正如@BondedDust建议的那样，首先我们对您的数据进行分组，每个培训每位客户1行：

library(dplyr)
trainingSummary <- top20CustomersRevenue %>% group_by(custno, TrainingDate) %>%
    summarize(Revenue = first(Revenue),
              TrainingType = first(TrainingType))

编辑：要为特定培训日插入收入，我们会查看上个月和下个月，并根据培训发生的月份来确定我们的位置。我将您的POSIX日期转换为Date个对象，如果需要，您可以在最后转换回来。

library(ggplot2)
library(dplyr)
library(lubridate)

top20CustomersRevenue <- top20CustomersRevenue %>% mutate(DeltaMonth = as.Date(DeltaMonth),
              TrainingDate = as.Date(TrainingDate)) 

trainingSummary <- top20CustomersRevenue %>% 
    group_by(custno, TrainingDate) %>%
    mutate(prev.month.rev = Revenue[DeltaMonth == floor_date(TrainingDate, unit = "month")],
           next.month.rev = Revenue[DeltaMonth == ceiling_date(TrainingDate, unit = "month")],
           interp.rev = prev.month.rev + (next.month.rev - prev.month.rev) * 
               ((mday(TrainingDate) - 1) / days_in_month(month(TrainingDate)))) %>%
    summarize(Revenue = first(interp.rev),
              TrainingType = first(TrainingType))
trainingSummary$TrainingType <- factor(trainingSummary$TrainingType)


p <- ggplot() + 
     geom_line(data = top20CustomersRevenue,
               aes(x = DeltaMonth, y=Revenue, group=custno), alpha=0.3) +
    theme_bw() +
    labs(y = 'Revenue (Dollars)', x = '') +
    theme(legend.title = element_blank()) +
    theme(legend.title = element_blank(),
          axis.text.y = element_text(hjust=0, angle=0), 
          axis.text.x = element_text(hjust=1, angle=45),
          plot.title = element_text(size=20)) +
    geom_point(data = trainingSummary,
               aes(x = TrainingDate, y = Revenue, color= TrainingType))
p

enter image description here

这适用于您提供的样本。如果你有任何训练发生在最后一个DeltaMonth之后或第一个DeltaMonth之前，他们就不会工作。

插值的一点点：

我们为了获得在特定培训日期绘制的y值（即收入）而做的非常简单。我们假设我们有约会培训d。我们获得前一个DeltaMonth的y（收入）值y_prev，以及下个月收入的y_next。由于所有DeltaMonth日期值都是使用floor_date()而在本月的第一天，ceiling_date()会获取上一个和下一个DeltaMonth日期值。

连接上个月和下个月收入的线的斜率为

slope = change in y / change in x = (y_next - y_prev) / (number of days in month)

因此，培训日期的y值是上一个收入（y_prev）加上斜率乘以该月初以来的天数。自月初开始的天数为mday(trainingDate) - 1，interp.rev中的其他内容为斜率。它只是一个有点和斜坡的高中代数。

将刻度线添加到折线图中，并按常数对刻度线进行分组

1 个答案:

插值的一点点：