我有一个数据框,我们称之为df1,看起来像这样:
month product_key price
201408 00020e32-a64715 75
201408 00020e32-a64715 75
201408 000340b8-bacac8 20
201408 000458f1-fdb6ae 45
201408 00083ebb-e9c17f 250
201408 00207e67-15a59f 480
201408 002777d7-50bec1 12
201408 002777d7-50bec1 12
201409 00020e32-a64715 75
201409 000340b8-bacac8 20
201409 00083ebb-e9c17f 250
201409 00207e67-15a59f 480
201409 00207e67-15a59f 480
201409 00207e67-15a59f 480
201410 00083ebb-e9c17f 250
201410 00207e67-15a59f 480
201410 00207e67-15a59f 480
201410 0020baff-9730f0 39.99
201411 00083ebb-e9c17f 250
201411 00207e67-15a59f 480
201412 00083ebb-e9c17f 250
201501 00083ebb-e9c17f 200
201501 0020baff-9730f0 29.99
数据集中还有其他变量,但我们不需要它们用于此目的。我的数据集是每月数据,范围从2014年中到2015年底。每个月有数百种产品,并且在一个月内可以多次出现相同的产品。
我想要做的是识别出现在8月和9月的产品,并删除两个月内没有出现的产品。然后我想计算每个月剩余产品的平均价格。然后我想将平均9月价格除以8月平均价格。在我的数据框中,这个计算的数字将是9月指数(8月默认为1,因为这是数据集开始的地方)。
然后我想在接下来的几个月内做同样的事情,所以我想确定9月和10月出现的产品,删除两个月内没有出现的产品,并计算平均价格(其余产品)每个月。然后我想将10月平均价格除以9月平均价格(这与之前计算的9月平均价格不同,因为9月和10月出现的产品很可能与8月份出现的产品相比和九月)。这个计算的数字将是十月指数。 所以我想在接下来的几个月(10月和11月,11月和12月,12月和1月,1月和2月......等等)这样做。
我的结果数据帧最好看起来像这样(使用任意数字作为索引):
month index
201408 1
201409 1.0005
201410 1.0152
201411 0.9997
201412 0.9551
201501 0.8985
201502 0.9754
201503 1.0045
201504 1.1520
201505 1.0148
201506 1.0452
201507 0.9945
201508 0.9751
201509 1.0004
201510 1.0415
当我尝试这样做时,我最终会在整个数据集上匹配产品,而不是连续2个月。我可以通过将数据集分解为每个月的众多数据集来实现这一点,但这看起来既冗长又乏味。我相信有更快的方法可以做到这一点吗?
您可以使用以下代码创建测试数据集:
month <- c("201408", "201408", "201408", "201408", "201408", "201408", "201408", "201408", "201409", "201409", "201409", "201409", "201409", "201409", "201410", "201410", "201410", "201410", "201411", "201411", "201412", "201501", "201501")
product_key <- c("00020e32-a64715", "00020e32-a64715", "000340b8-bacac8", "000458f1-fdb6ae", "00083ebb-e9c17f", "00083ebb-e9c17f", "002777d7-50bec1", "002777d7-50bec1", "00020e32-a64715", "000340b8-bacac8", "00083ebb-e9c17f", "00207e67-15a59f", "00207e67-15a59f", "00207e67-15a59f", "00083ebb-e9c17f", "00207e67-15a59f", "00207e67-15a59f", "0020baff-9730f0", "00083ebb-e9c17f", "00207e67-15a59f", "00083ebb-e9c17f", "00083ebb-e9c17f", "0020baff-9730f0")
price <- c("75", "75", "20", "45", "250", "480", "12", "12", "75", "20", "250", "480", "480", "480", "250", "480", "480", "39.99", "250", "480", "250", "200", "29.99")
df1 <- data.frame(month, product_key, price)
举一个我希望如何工作的例子 - 这是我为8月和9月创建索引所做的。
DF1Aug <- DF1 %>%
filter(month %in% "201408") %>%
group_by(product_key) %>%
summarize(aveprice=mean(price))
DF1Sept <- DF1 %>%
filter(month %in% "201409") %>%
group_by(product_key) %>%
summarize(aveprice=mean(price))
SeptPriceIndex <- transform(merge(DF1Aug, DF1Sept, by=c("product_key"), suffixes=c("_Aug", "_Sept"))) %>%
mutate(AugAvgPrice=mean(aveprice_Aug)) %>%
mutate(SeptAvgPrice=mean(aveprice_Sept)) %>%
mutate(priceIndex = SeptAvgPrice/AugAvgPrice)
然而,这显然是一个繁琐的过程,我在数据框架中已有20个月左右(我需要在多个数据帧上执行此操作),所以我想找到一种自动化方法。< / p>
答案 0 :(得分:0)
以下内容可以使用dplyr
和tidy
(已更新):
df %>%
# ensure data is sorted so that months are sequential by product key:
arrange(product_key, month) %>%
# ensure every product month combo exists:
complete(product_key, month) %>%
# create a sequential id within each product:
group_by(product_key) %>%
mutate(grp_seq = row_number()) %>%
# remove product / month pairs without a price:
filter(!is.na(price)) %>%
# remove product keys that appear in only one month:
filter(n_distinct(month) > 1) %>%
# remove non-consecutive product / month pairs:
filter(lead(grp_seq) - 1 == grp_seq | lag(grp_seq) + 1 == grp_seq) %>%
# summarize the average price by month:
group_by(month) %>%
summarize(avg_price = mean(as.numeric(price))) %>%
# calculate the price index:
mutate(index_price = avg_price / lag(avg_price))
# A tibble: 6 x 3
month avg_price index_price
<chr> <dbl> <dbl>
1 201408 180. NA
2 201409 298. 1.65
3 201410 403. 1.36
4 201411 365. 0.905
5 201412 250. 0.685
6 201501 200. 0.800
答案 1 :(得分:0)
OP希望通过计算所有经常性产品的所有记录价格的平均值并除以平均每月价格来获取随后两个月的价格指数。
这可能是OP想要的,但我不认为这是正确的方法:
这是一个虚构的例子,可以解释我的意思。假设我们有两种产品。产品A
价格昂贵,4月有两个记录价格,但5月没有价格变化。产品B
很便宜,但其价格在5月翻了一番。因此,我期望价格指数将反映出上涨。
library(data.table)
example <- fread(
"month product_key price
201704 A 90
201704 A 110
201704 B 1
201705 A 100
201705 B 2")
# OP's approach
example[, .(avg_price = mean(price)), by = month][
, price_index := avg_price / shift(avg_price)][]
month avg_price price_index 1: 201704 67 NA 2: 201705 51 0.761194
因此,根据OP的方法,价格指数已下降。
我相信正确的方法是
(我为我更熟悉的data.table
语法而道歉。我曾尝试使用dplyr
语法,但花了我太多时间。)
# compute average monthly price for each product
tmp1 <- example[, .(avg_price = mean(price)), keyby = .(product_key, month)]
tmp1
product_key month avg_price 1: A 201704 100 2: A 201705 100 3: B 201704 1 4: B 201705 2
# compute price index for each product
tmp2 <- tmp1[, price_index := avg_price / shift(avg_price), by = product_key][]
tmp2
product_key month avg_price price_index 1: A 201704 100 NA 2: A 201705 100 1 3: B 201704 1 NA 4: B 201705 2 2
# compute average price index
tmp2[, .(avg_price_index = mean(price_index, na.rm = TRUE)), by = month]
month avg_price_index 1: 201704 NaN 2: 201705 1.5
现在,价格指数显示出符合 my 的期望值(可能不是操作者的期望值)。
OP要求计算几个月的价格指数,但只能计算随后几个月出现的产品。这可以通过自我加入(每个月轮换一次)来解决。
请注意,简单的lag()
或shift()
在这里很危险,因为它依赖于行顺序,如果缺少月份,它将失败。因此,使用日期算术来查找正确的后续月份。
sef join 方法还具有其他优点,即仅考虑循环产品。如果product_key
在下个月不匹配,则price
将为NA
。计算平均价格指数时,这些条目将被删除。
library(data.table)
library(magrittr)
DF2 <- setDT(DF1)[
# convert price from factor to numeric
, price := price %>% as.character() %>% as.numeric()][
# convert character month to Date
, month := month %>% lubridate::ymd(truncated = 1L)][
# compute average monthly price for each product
, .(avg_price = mean(price)), keyby = .(product_key, month)]
# self join with subsequent month
DF2[DF2[, .(product_key, month = month + months(1), avg_price)],
on = .(product_key, month)][
# compute price index for each product
, price_index := avg_price / i.avg_price][
# compute average price index
, .(avg_price_index = mean(price_index, na.rm = TRUE)), by = month]
month avg_price_index 1: 2014-09-01 0.8949772 2: 2014-10-01 1.0000000 3: 2014-11-01 1.0000000 4: 2014-12-01 1.0000000 5: 2015-01-01 0.8000000 6: 2015-02-01 NaN
由OP提供
month <- c("201408", "201408", "201408", "201408", "201408", "201408", "201408", "201408", "201409", "201409", "201409", "201409", "201409", "201409", "201410", "201410", "201410", "201410", "201411", "201411", "201412", "201501", "201501")
product_key <- c("00020e32-a64715", "00020e32-a64715", "000340b8-bacac8", "000458f1-fdb6ae", "00083ebb-e9c17f", "00083ebb-e9c17f", "002777d7-50bec1", "002777d7-50bec1", "00020e32-a64715", "000340b8-bacac8", "00083ebb-e9c17f", "00207e67-15a59f", "00207e67-15a59f", "00207e67-15a59f", "00083ebb-e9c17f", "00207e67-15a59f", "00207e67-15a59f", "0020baff-9730f0", "00083ebb-e9c17f", "00207e67-15a59f", "00083ebb-e9c17f", "00083ebb-e9c17f", "0020baff-9730f0")
price <- c("75", "75", "20", "45", "250", "480", "12", "12", "75", "20", "250", "480", "480", "480", "250", "480", "480", "39.99", "250", "480", "250", "200", "29.99")
DF1 <- data.frame(month, product_key, price)
请注意,所有列都是因素。