使用dplyr和/或cut将连续变量分解为类别

时间:2017-09-06 21:38:09

标签: r dplyr cut categorical-data

我有一个数据集,它是价格变化的记录,以及其他变量。我想将价格列变为分类变量。我知道R中的两个重要功能似乎是dplyr和/或cut

> head(btc_data)
                 time  btc_price
1 2017-08-27 22:50:00 4,389.6113
2 2017-08-27 22:51:00 4,389.0850
3 2017-08-27 22:52:00 4,388.8625
4 2017-08-27 22:53:00 4,389.7888
5 2017-08-27 22:56:00 4,389.9138
6 2017-08-27 22:57:00 4,390.1663


>dput(btc_data)
        ("4,972.0700", "4,972.1763", "4,972.6563", "4,972.9188", "4,972.9763", 
        "4,973.1575", "4,974.9038", "4,975.0913", "4,975.1738", "4,975.9325", 
        "4,976.0725", "4,976.1275", "4,976.1825", "4,976.1888", "4,979.0025", 
        "4,979.4800", "4,982.7375", "4,983.1813", "4,985.3438", "4,989.2075", 
        "4,989.7888", "4,990.1850", "4,991.4500", "4,991.6600", "4,992.5738", 
        "4,992.6900", "4,992.8025", "4,993.8388", "4,994.7013", "4,995.0788", 
        "4,995.8800", "4,996.3338", "4,996.4188", "4,996.6725", "4,996.7038", 
        "4,997.1538", "4,997.7375", "4,997.7750", "5,003.5150", "5,003.6288", 
        "5,003.9188", "5,004.2113", "5,005.1413", "5,005.2588", "5,007.2788", 
        "5,007.3125", "5,007.6788", "5,008.8600", "5,009.3975", "5,009.7175", 
        "5,010.8500", "5,011.4138", "5,011.9838", "5,013.1250", "5,013.4350", 
        "5,013.9075"), class = "factor")), .Names = c("time", "btc_price"
    ), class = "data.frame", row.names = c(NA, -10023L))

困难在于我想要创建的类别。类别-1,0,1应基于前一时间延迟的百分比变化。

因此,例如,过去60分钟内价格上涨20%将标记为1,否则为0.过去60分钟内价格下跌20%应为-1,否则为0.

这可能在R?实施变更的最有效方法是什么?

有一个类似的问题herehere,但由于两个原因,这些问题无法回答我的问题 -

  

a)我试图计算%变化,而不仅仅是差异   在2行之间。

     

b)此计算应基于滚动过去时间范围的最大/最小值(即 - 过去一小时减少20%= -1,过去一小时增加20%= 1

3 个答案:

答案 0 :(得分:0)

总是很难处理百分比。你需要知道每件事都是灵活的:当你选择一个差异的参考,一个运行的平均值,最大值或者其他 - 你必须在参考的一侧至少有两个变量,你必须仔细选择。与您要参考的值相关的值相同。这一起为您提供了几乎无限的可能性,您可以如何计算百分比。这是你问题的关键。

# create the data

dat <- c("4,972.0700", "4,972.1763", "4,972.6563", "4,972.9188", "4,972.9763", 
         "4,973.1575", "4,974.9038", "4,975.0913", "4,975.1738", "4,975.9325", 
         "4,976.0725", "4,976.1275", "4,976.1825", "4,976.1888", "4,979.0025", 
         "4,979.4800", "4,982.7375", "4,983.1813", "4,985.3438", "4,989.2075", 
         "4,989.7888", "4,990.1850", "4,991.4500", "4,991.6600", "4,992.5738", 
         "4,992.6900", "4,992.8025", "4,993.8388", "4,994.7013", "4,995.0788", 
         "4,995.8800", "4,996.3338", "4,996.4188", "4,996.6725", "4,996.7038", 
         "4,997.1538", "4,997.7375", "4,997.7750", "5,003.5150", "5,003.6288", 
         "5,003.9188", "5,004.2113", "5,005.1413", "5,005.2588", "5,007.2788", 
         "5,007.3125", "5,007.6788", "5,008.8600", "5,009.3975", "5,009.7175", 
         "5,010.8500", "5,011.4138", "5,011.9838", "5,013.1250", "5,013.4350", 
         "5,013.9075")
dat <- as.numeric(gsub(",","",dat))

# calculate the difference to the last minute
dd <- diff(dat)

# calculate the running ratio to difference of the last minutes
interval = 20
out <- NULL
for(z in interval:length(dd)){
  out <- c(out, (dd[z] / mean(dd[(z-interval):z])))
}

# calculate the running ratio to price of the last minutes
out2 <- NULL
for(z in interval:length(dd)){
  out2 <- c(out2, (dat[z] / mean(dat[(z-interval):z])))
}

# build categories for difference-ratio
catego <- as.vector(cut(out, breaks=c(-Inf,0.8,1.2,Inf), labels=c(-1,0,1)))
catego <- c(rep(NA,interval+1), as.numeric(catego))


# plot
plot(dat, type="b", main="price orginal")
plot(dd, main="absolute difference to last minute", type="b")
plot(out, main=paste('difference to last minute, relative to "mean" of the last', interval, 'min'), type="b")
abline(h=c(0.8, 1.2), col="magenta")
plot(catego, main=paste("categories for", interval))
plot(out2, main=paste('price last minute, relative to "mean" of the last', interval, 'min'), type="b")

我认为你搜索如何计算最后一个图(price last minute, relative to "mean" of t...)的方式,这个例子中的值在1.0010和1.0025之间变化,远远超出你对0.8和1.2的预期。当你选择一个更大的时间间隔而不是20分钟或者一周可能是好的(11340)时,你可以使差异更大,但即使有这个高时间值,也很难达到1.2以上的值。问题是5000的高价格变化10很少。

您还必须考虑到您的价格不断上涨,因此无法获得低于1的价值。

在此计算中,我使用mean()进行最后几分钟的运行观察。我不确定,但我推测在股票市场上,您在不同的时间间隔内使用min()max()作为参考。当价格上涨时,您选择min()作为参考;当价格下跌时,您选择max()。所有这一切都可以在R。

答案 1 :(得分:0)

这是一种简单的方法,无需依赖data.table包。如果您只需要60分钟的间隔时间,则首先需要按相关的60分钟间隔过滤btc_data

# make sure time is a date that can be sorted properly
btc_data$time = as.POSIXct(btc_data$time)

# sort data frame
btc_data = btc_data[order(btc_data$time),]

# calculate percentage change for 1 minute lag
btc_data$perc_change = NA
btc_data$perc_change[2:nrow(btc_data)] = (btc_data$btc_price[2:nrow(btc_data)] - btc_data$btc_price[1:(nrow(btc_data)-1)])/btc_data$btc_price[1:(nrow(btc_data)-1)]

# create category column
# NOTE: first category entry will be NA
btc_data$category = ifelse(btc_data$perc_change > 0.20, 1, ifelse(btc_data$perc_change < -0.20, -1, 0)) 

使用data.table包并将btc_data转换为data.table将是一种更有效,更快捷的方法。使用该软件包有一个学习曲线,但是这个软件包有很好的插图和教程。

答案 2 :(得分:-2)

我不能完全重现你的例子,但如果我不得不猜你会想做这样的事情:

btc_data$btc_price <- as.character(btc_data$btc_price)
btc_data$btc_price <- as.data.frame(as.numeric(gsub(",", "", 
btc_data$btc_price)))


pct_change <- NULL
for (i in 61:nrow(btc_data$btc_price)){
pct_change[i] <- (btc_data$btc_price[i,] - btc_data$btc_price[i - 60,]) / 
btc_data$btc_price[i - 60,]
}


pct_change <- pct_change[61:length(pct_change)]


new_category <- cut(pct_change, breaks = c(min(pct_change), -.2, .2, 
max(pct_change)), labels = c(-1,0,1))

btc_data.new <- btc_data[61 : nrow(btc_data),]
btc.data.new <- data.frame(btc_data.new, new_category)