我有一些数据,其中每隔一列对应一个特定的时间,每个时间段都分别“购买”#39;和'销售'位置,这些位置中的每一个都有两个因素(如下所示)。然而,这些列的长度不等,所以'出售'选项从不同的行开始(隐藏在值中)。
time, time1, time, time2, time, time3
buy, , buy, , buy,
factor1, 1, factor1, 2, factor1, 3
factor2, 4, factor2, 5, factor2, 6
factor1, 7, factor1, 8, factor1, 9
factor2, 10, factor2, 11, factor2, 12
factor1, 13, sell, , factor1, 14
factor2, 15, factor1, 16, factor2, 17
sell, , factor2, 18, factor1, 19
factor1, 20, , , factor2, 21,
factor2, 22, , , sell,
, , , , factor1, 23
, , , , factor2, 24
, , , , factor1, 25
, , , , factor2, 26
最终,我希望我的表格结构如下。
time, position, factor, value
time1, buy, factor1, 1
time1, buy, factor2, 4
time1, buy, factor1, 7
time1, buy, factor2, 10
time1, buy, factor1, 13
time1, buy, factor2, 15
time1, sell, factor1, 20
time1, sell, factor2, 22
time2, buy, factor1, 2
time2, buy, factor2, 5
time2, buy, factor1, 8
time2, buy, factor2, 11
time2, sell, factor1, 16
time2, sell, factor2, 18
time3, buy, factor1, 3
time3, buy, factor2, 6
time3, buy, factor1, 9
time3, buy, factor2, 12
time3, buy, factor1, 14
time3, buy, factor2, 17
time3, buy, factor1, 19
time3, buy, factor2, 21
time3, sell, factor1, 23
time3, sell, factor2, 24
time3, sell, factor1, 25
time3, sell, factor2, 26
我能够提取索引,然后分别创建' buy'和'销售' R中列出但我不确定这是否是最简单的方法(我有很多这样的文件,并且更喜欢快速自动方法)。我也愿意在Python中进行转换,而不是R。
# For each column find the index of buy, sell (and the corresponding empty cell)
idx = apply(data, 2, function(x) which(x %in% c("buy","sell",""))[1:3] )
# NA indicates that the empty cell is the last
idx[is.na(idx)] = nrow(data)
i = 0
buy = list( apply(idx, 2, function(x) {
i <<- i+1
data[seq(x[1]+1,x[2]),i]
}) )
i = 0
sell = list( apply(idx, 2, function(x) {
i <<- i+1
data[seq(x[2]+1,x[3]),i]
}) )
答案 0 :(得分:2)
我决定在一个长格式数据集中首先组合3组2列。然后按结转的最后一个已知值(tidyr::fill
)填写位置列,并通过过滤列值过滤掉垃圾。
以下是工作示例:
library(dplyr)
library(tidyr)
str <- "
time, time1, time, time2, time, time3
buy, , buy, , buy,
factor1, 1, factor1, 2, factor1, 3
factor2, 4, factor2, 5, factor2, 6
factor1, 7, factor1, 8, factor1, 9
factor2, 10, factor2, 11, factor2, 12
factor1, 13, sell, , factor1, 14
factor2, 15, factor1, 16, factor2, 17
sell, , factor2, 18, factor1, 19
factor1, 20, , , factor2, 21,
factor2, 22, , , sell,
, , , , factor1, 23
, , , , factor2, 24
, , , , factor1, 25
, , , , factor2, 26
"
strfile <- textConnection(str)
raw <- read.table(strfile, header = F, sep = ",", stringsAsFactors = F)
library(dplyr)
library(tidyr)
dt <- do.call(rbind, lapply(1:3, function(x) {
p <- raw[,c(x*2-1,x*2)]
names(p) <- c('factor', 'value')
p$time <- x
p
})
)
dt %>%
mutate(position = if_else(trimws(factor) %in% c('buy','sell'),as.character(factor),as.character(NA)),
value = as.numeric(value)) %>%
fill(position) %>% filter(!is.na(value))
结果:
factor value time position
1 factor1 1 1 buy
2 factor2 4 1 buy
3 factor1 7 1 buy
4 factor2 10 1 buy
5 factor1 13 1 buy
6 factor2 15 1 buy
7 factor1 20 1 sell
8 factor2 22 1 sell
9 factor1 2 2 buy
10 factor2 5 2 buy
11 factor1 8 2 buy
12 factor2 11 2 buy
13 factor1 16 2 sell
14 factor2 18 2 sell
15 factor1 3 3 buy
16 factor2 6 3 buy
17 factor1 9 3 buy
18 factor2 12 3 buy
19 factor1 14 3 buy
20 factor2 17 3 buy
21 factor1 19 3 buy
22 factor2 21 3 buy
23 factor1 23 3 sell
24 factor2 24 3 sell
25 factor1 25 3 sell
26 factor2 26 3 sell