Tidyr解决方案半融合数据

时间:2019-07-05 18:15:04

标签: r tidyr

我正在尝试清理一些泵站数据,这些数据是由工厂操作员手动输入的DATE和STOP / START卷的基于Excel的日志工作簿值。棘手的是,这三个值是作为跨多列的重复行输入的。很难用口头表达来形容(如果有人遇到类似的问题,搜索就更少了),因此称其为“半融化”。这是其中一些内容的样子:

structure(list(X1 = c("DATE", "STOP", "START", "DATE", "STOP", 
"START", "DATE", "STOP", "START", "DATE"), X2 = c(43466, 654896, 
654276, 43470, 657669, 656819, 43474, 660160, 659368, 43478), 
X3 = c("DATE", "STOP", "START", "DATE", "STOP", "START", 
"DATE", "STOP", "START", "DATE"), X4 = c(43467, 655298, 654896, 
43471, 658268, 657669, 43475, 660977, 660160, 43479), X5 = c("DATE", 
"STOP", "START", "DATE", "STOP", "START", "DATE", "STOP", 
"START", "DATE"), X6 = c("43468", "655959", "655298", "43472", 
"658620", "658268", "43476", "661774", "660977", "43480"), 
X7 = c("DATE", "STOP", "START", "DATE", "STOP", "START", 
"DATE", "STOP", "START", "DATE"), X8 = c("43469", "656819", 
"655959", "43473", "659368", "658620", "43477", "662673", 
"661774", "43481")), row.names = c(NA, 10L), class = "data.frame")

我想将其整理成一个具有DATE,START和STOP三列的时间序列。看起来像这样:

     Date  Start   Stop
1  43466 654276 654896
2  43470 656819 657669
3  43474 659368 660160
4  43478 662673 663168
5  43482 665148 665951
6  43486 667944 668537
7  43490 670950 671692
8  43494 673621 674418
9  43497 676090 676884
10 43501 678559 679399

我对收集和传播函数从未有过很好的了解(仍然非常喜欢melt和dcast),但是令我高兴的是,我看到了更新的函数pivot_longer和pivot_wider。我在上述任何一个函数中都有一个很好的解决方案,但我一直被那些希望当前列名(“ X1”至“ X8”)有意义的函数所困扰,但实际上它们是任意的。

有什么建议吗?

2 个答案:

答案 0 :(得分:1)

这是一种方法-

df2 <- as.matrix(df)
rbind(df2[,1:2], df2[,3:4], df2[,5:6], df2[,7:8]) %>% 
  as_tibble() %>%
  mutate(id = cumsum(X1 == "DATE")) %>% 
  spread(X1, X2, convert = T) %>% 
  arrange(DATE, START, STOP)

# A tibble: 16 x 4
      id  DATE  START   STOP
   <int> <int>  <int>  <int>
 1     1 43466 654276 654896
 2     5 43467 654896 655298
 3     9 43468 655298 655959
 4    13 43469 655959 656819
 5     2 43470 656819 657669
 6     6 43471 657669 658268
 7    10 43472 658268 658620
 8    14 43473 658620 659368
 9     3 43474 659368 660160
10     7 43475 660160 660977
11    11 43476 660977 661774
12    15 43477 661774 662673
13     4 43478     NA     NA
14     8 43479     NA     NA
15    12 43480     NA     NA
16    16 43481     NA     NA

原始数据-

df
      X1     X2    X3     X4    X5     X6    X7     X8
1   DATE  43466  DATE  43467  DATE  43468  DATE  43469
2   STOP 654896  STOP 655298  STOP 655959  STOP 656819
3  START 654276 START 654896 START 655298 START 655959
4   DATE  43470  DATE  43471  DATE  43472  DATE  43473
5   STOP 657669  STOP 658268  STOP 658620  STOP 659368
6  START 656819 START 657669 START 658268 START 658620
7   DATE  43474  DATE  43475  DATE  43476  DATE  43477
8   STOP 660160  STOP 660977  STOP 661774  STOP 662673
9  START 659368 START 660160 START 660977 START 661774
10  DATE  43478  DATE  43479  DATE  43480  DATE  43481

答案 1 :(得分:0)

如果您愿意的话,我有一个不错的data.table解决方案,但它假设每个日期都有一个开始和结束时间,在您的示例中,情况并非如此。所以我只保留前9行:

library(data.table)
df <- df[1:9]
df <- as.data.table(df)

这是我的三招:

melt_tot <- melt(df, measure.vars = c(paste0("X",which(1:8 %% 2 == 1)),paste0("X",which(1:8 %% 2 == 0))))
df2 <- data.table(type = melt_tot[1:(.N/2),value],
              value = melt_tot[-(1:(.N/2)),value],
              I = rep(1:(melt_tot[,.N]/(2*3)),each = 3) )
dcast(df2,I~type)

> dcast(df2,I~type)
     I  DATE  START   STOP
 1:  1 43466 654276 654896
 2:  2 43470 656819 657669
 3:  3 43474 659368 660160
 4:  4 43467 654896 655298
 5:  5 43471 657669 658268
 6:  6 43475 660160 660977
 7:  7 43468 655298 655959
 8:  8 43472 658268 658620
 9:  9 43476 660977 661774
10: 10 43469 655959 656819
11: 11 43473 658620 659368
12: 12 43477 661774 662673

诀窍是完全融合偶数和奇数X列上的数据

melt_tot <- melt(df, measure.vars = c(paste0("X",which(1:8 %% 2 == 1)),paste0("X",which(1:8 %% 2 == 0))))

然后我将value列一分为二:一个包含值,另一个具有类型(即开始,结束或日期),并创建一个跨越其他三种类型的索引。

df2 <- data.table(type = melt_tot[1:(.N/2),value],
                  value = melt_tot[-(1:(.N/2)),value],
                  I = rep(1:(melt_tot[,.N]/(2*3)),each = 3) )

> df2
     type  value  I
 1:  DATE  43466  1
 2:  STOP 654896  1
 3: START 654276  1
 4:  DATE  43470  2
 5:  STOP 657669  2
 6: START 656819  2

然后我只需要dcast

dcast(df2,I~type)