我正在尝试清理一些泵站数据,这些数据是由工厂操作员手动输入的DATE和STOP / START卷的基于Excel的日志工作簿值。棘手的是,这三个值是作为跨多列的重复行输入的。很难用口头表达来形容(如果有人遇到类似的问题,搜索就更少了),因此称其为“半融化”。这是其中一些内容的样子:
structure(list(X1 = c("DATE", "STOP", "START", "DATE", "STOP",
"START", "DATE", "STOP", "START", "DATE"), X2 = c(43466, 654896,
654276, 43470, 657669, 656819, 43474, 660160, 659368, 43478),
X3 = c("DATE", "STOP", "START", "DATE", "STOP", "START",
"DATE", "STOP", "START", "DATE"), X4 = c(43467, 655298, 654896,
43471, 658268, 657669, 43475, 660977, 660160, 43479), X5 = c("DATE",
"STOP", "START", "DATE", "STOP", "START", "DATE", "STOP",
"START", "DATE"), X6 = c("43468", "655959", "655298", "43472",
"658620", "658268", "43476", "661774", "660977", "43480"),
X7 = c("DATE", "STOP", "START", "DATE", "STOP", "START",
"DATE", "STOP", "START", "DATE"), X8 = c("43469", "656819",
"655959", "43473", "659368", "658620", "43477", "662673",
"661774", "43481")), row.names = c(NA, 10L), class = "data.frame")
我想将其整理成一个具有DATE,START和STOP三列的时间序列。看起来像这样:
Date Start Stop
1 43466 654276 654896
2 43470 656819 657669
3 43474 659368 660160
4 43478 662673 663168
5 43482 665148 665951
6 43486 667944 668537
7 43490 670950 671692
8 43494 673621 674418
9 43497 676090 676884
10 43501 678559 679399
我对收集和传播函数从未有过很好的了解(仍然非常喜欢melt和dcast),但是令我高兴的是,我看到了更新的函数pivot_longer和pivot_wider。我在上述任何一个函数中都有一个很好的解决方案,但我一直被那些希望当前列名(“ X1”至“ X8”)有意义的函数所困扰,但实际上它们是任意的。
有什么建议吗?
答案 0 :(得分:1)
这是一种方法-
df2 <- as.matrix(df)
rbind(df2[,1:2], df2[,3:4], df2[,5:6], df2[,7:8]) %>%
as_tibble() %>%
mutate(id = cumsum(X1 == "DATE")) %>%
spread(X1, X2, convert = T) %>%
arrange(DATE, START, STOP)
# A tibble: 16 x 4
id DATE START STOP
<int> <int> <int> <int>
1 1 43466 654276 654896
2 5 43467 654896 655298
3 9 43468 655298 655959
4 13 43469 655959 656819
5 2 43470 656819 657669
6 6 43471 657669 658268
7 10 43472 658268 658620
8 14 43473 658620 659368
9 3 43474 659368 660160
10 7 43475 660160 660977
11 11 43476 660977 661774
12 15 43477 661774 662673
13 4 43478 NA NA
14 8 43479 NA NA
15 12 43480 NA NA
16 16 43481 NA NA
原始数据-
df
X1 X2 X3 X4 X5 X6 X7 X8
1 DATE 43466 DATE 43467 DATE 43468 DATE 43469
2 STOP 654896 STOP 655298 STOP 655959 STOP 656819
3 START 654276 START 654896 START 655298 START 655959
4 DATE 43470 DATE 43471 DATE 43472 DATE 43473
5 STOP 657669 STOP 658268 STOP 658620 STOP 659368
6 START 656819 START 657669 START 658268 START 658620
7 DATE 43474 DATE 43475 DATE 43476 DATE 43477
8 STOP 660160 STOP 660977 STOP 661774 STOP 662673
9 START 659368 START 660160 START 660977 START 661774
10 DATE 43478 DATE 43479 DATE 43480 DATE 43481
答案 1 :(得分:0)
如果您愿意的话,我有一个不错的data.table
解决方案,但它假设每个日期都有一个开始和结束时间,在您的示例中,情况并非如此。所以我只保留前9行:
library(data.table)
df <- df[1:9]
df <- as.data.table(df)
这是我的三招:
melt_tot <- melt(df, measure.vars = c(paste0("X",which(1:8 %% 2 == 1)),paste0("X",which(1:8 %% 2 == 0))))
df2 <- data.table(type = melt_tot[1:(.N/2),value],
value = melt_tot[-(1:(.N/2)),value],
I = rep(1:(melt_tot[,.N]/(2*3)),each = 3) )
dcast(df2,I~type)
> dcast(df2,I~type)
I DATE START STOP
1: 1 43466 654276 654896
2: 2 43470 656819 657669
3: 3 43474 659368 660160
4: 4 43467 654896 655298
5: 5 43471 657669 658268
6: 6 43475 660160 660977
7: 7 43468 655298 655959
8: 8 43472 658268 658620
9: 9 43476 660977 661774
10: 10 43469 655959 656819
11: 11 43473 658620 659368
12: 12 43477 661774 662673
诀窍是完全融合偶数和奇数X
列上的数据
melt_tot <- melt(df, measure.vars = c(paste0("X",which(1:8 %% 2 == 1)),paste0("X",which(1:8 %% 2 == 0))))
然后我将value列一分为二:一个包含值,另一个具有类型(即开始,结束或日期),并创建一个跨越其他三种类型的索引。
df2 <- data.table(type = melt_tot[1:(.N/2),value],
value = melt_tot[-(1:(.N/2)),value],
I = rep(1:(melt_tot[,.N]/(2*3)),each = 3) )
> df2
type value I
1: DATE 43466 1
2: STOP 654896 1
3: START 654276 1
4: DATE 43470 2
5: STOP 657669 2
6: START 656819 2
然后我只需要dcast
dcast(df2,I~type)