长度不等的列和标题埋在这些列中

时间:2017-07-06 13:49:28

标签: python r data-cleaning

我有一些数据,其中每隔一列对应一个特定的时间,每个时间段都分别“购买”#39;和'销售'位置,这些位置中的每一个都有两个因素(如下所示)。然而,这些列的长度不等,所以'出售'选项从不同的行开始(隐藏在值中)。

time, time1, time,  time2,  time, time3
buy,       , buy,        ,  buy,    
factor1,  1, factor1,    2, factor1,  3
factor2,  4, factor2,    5, factor2,  6
factor1,  7, factor1,    8, factor1,  9
factor2, 10, factor2,   11, factor2, 12
factor1, 13, sell,        , factor1, 14
factor2, 15, factor1,   16, factor2, 17
sell,      , factor2,   18, factor1, 19
factor1, 20, ,            , factor2, 21,
factor2, 22, ,            , sell,     
,          , ,            , factor1, 23
,          , ,            , factor2, 24
,          , ,            , factor1, 25
,          , ,            , factor2, 26

最终,我希望我的表格结构如下。

time,   position,   factor,     value
time1,  buy,        factor1,    1
time1,  buy,        factor2,    4
time1,  buy,        factor1,    7
time1,  buy,        factor2,    10
time1,  buy,        factor1,    13
time1,  buy,        factor2,    15
time1,  sell,       factor1,    20
time1,  sell,       factor2,    22
time2,  buy,        factor1,    2
time2,  buy,        factor2,    5
time2,  buy,        factor1,    8
time2,  buy,        factor2,    11
time2,  sell,       factor1,    16
time2,  sell,       factor2,    18
time3,  buy,        factor1,    3
time3,  buy,        factor2,    6
time3,  buy,        factor1,    9
time3,  buy,        factor2,    12
time3,  buy,        factor1,    14
time3,  buy,        factor2,    17
time3,  buy,        factor1,    19
time3,  buy,        factor2,    21
time3,  sell,       factor1,    23
time3,  sell,       factor2,    24
time3,  sell,       factor1,    25
time3,  sell,       factor2,    26

我能够提取索引,然后分别创建' buy'和'销售' R中列出但我不确定这是否是最简单的方法(我有很多这样的文件,并且更喜欢快速自动方法)。我也愿意在Python中进行转换,而不是R。

# For each column find the index of buy, sell (and the corresponding empty cell)
idx = apply(data, 2, function(x) which(x %in% c("buy","sell",""))[1:3] )
# NA indicates that the empty cell is the last
idx[is.na(idx)] = nrow(data)

i = 0
buy = list( apply(idx, 2, function(x) {
  i <<- i+1
  data[seq(x[1]+1,x[2]),i]
}) )
i = 0
sell = list( apply(idx, 2, function(x) {
  i <<- i+1
  data[seq(x[2]+1,x[3]),i]
}) )

1 个答案:

答案 0 :(得分:2)

我决定在一个长格式数据集中首先组合3组2列。然后按结转的最后一个已知值(tidyr::fill)填写位置列,并通过过滤列值过滤掉垃圾。

以下是工作示例:

library(dplyr)
library(tidyr)

str <- "
time, time1, time,  time2,  time, time3
buy,       , buy,        ,  buy,    
factor1,  1, factor1,    2, factor1,  3
factor2,  4, factor2,    5, factor2,  6
factor1,  7, factor1,    8, factor1,  9
factor2, 10, factor2,   11, factor2, 12
factor1, 13, sell,        , factor1, 14
factor2, 15, factor1,   16, factor2, 17
sell,      , factor2,   18, factor1, 19
factor1, 20, ,            , factor2, 21,
factor2, 22, ,            , sell,     
,          , ,            , factor1, 23
,          , ,            , factor2, 24
,          , ,            , factor1, 25
,          , ,            , factor2, 26
"

strfile <- textConnection(str)

raw <- read.table(strfile, header = F, sep = ",", stringsAsFactors = F)

library(dplyr)
library(tidyr)
dt <- do.call(rbind, lapply(1:3, function(x) {
  p <- raw[,c(x*2-1,x*2)]
  names(p) <- c('factor', 'value')
  p$time <- x
  p
  })
)

dt %>% 
  mutate(position = if_else(trimws(factor) %in% c('buy','sell'),as.character(factor),as.character(NA)),
         value = as.numeric(value)) %>%
  fill(position) %>% filter(!is.na(value))

结果:

factor value time position
1   factor1     1    1      buy
2   factor2     4    1      buy
3   factor1     7    1      buy
4   factor2    10    1      buy
5   factor1    13    1      buy
6   factor2    15    1      buy
7   factor1    20    1     sell
8   factor2    22    1     sell
9   factor1     2    2      buy
10  factor2     5    2      buy
11  factor1     8    2      buy
12  factor2    11    2      buy
13  factor1    16    2     sell
14  factor2    18    2     sell
15  factor1     3    3      buy
16  factor2     6    3      buy
17  factor1     9    3      buy
18  factor2    12    3      buy
19  factor1    14    3      buy
20  factor2    17    3      buy
21  factor1    19    3      buy
22  factor2    21    3      buy
23  factor1    23    3     sell
24  factor2    24    3     sell
25  factor1    25    3     sell
26  factor2    26    3     sell