如何将来自不同列的数据整理成一个

时间:2016-07-25 06:41:59

标签: r tidyr

我担心我的头衔并没有完全反映我的问题。 让我尽我所能地解释我的问题。这是我数据的一部分。

pdpass begyr1 gvkey1 endyr1 begyr2 gvkey2 endyr2 begyr3 gvkey3 endyr3 begyr4 gvkey4 endyr4 begyr5 gvkey5 endyr5
-16348   1991  26005   1993   1994   8852   1996   1997   3708   2005     NA     NA     NA     NA     NA     NA
-16179   2006 174876   2006     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
-16161   1993  30964   2005     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
-15631   1996 111491   2006     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
-14668   1988  15854   2003     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
-13980   1986  13218   1995   1996   9921   2005     NA     NA     NA     NA     NA     NA     NA     NA     NA

由于它是一个宽数据,因此我也会输入dput值。

#dput

structure(list(pdpass = c(-16348L, -16179L, -16161L, -15631L, -14668L, -13980L),
begyr1 = c(1991L, 2006L, 1993L, 1996L, 1988L,  1986L), gvkey1 = c(26005L, 174876L, 30964L, 111491L, 15854L, 13218L), endyr1 = c(1993L, 2006L, 2005L, 2006L, 2003L, 1995L), 
begyr2 = c(1994L, NA, NA, NA, NA, 1996L), gvkey2 = c(8852L, NA, NA, NA, NA, 9921L), endyr2 = c(1996L, NA, NA, NA, NA, 2005L), 
begyr3 = c(1997L, NA, NA, NA, NA, NA), gvkey3 = c(3708L, NA, NA, NA, NA, NA), endyr3 = c(2005L, NA, NA, NA, NA, NA),
begyr4 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_), gvkey4 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_), endyr4 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_),
begyr5 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_), gvkey5 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_), endyr5 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_)), row.names = c(NA, -6L), class = "data.frame", .Names = c("pdpass", "begyr1", "gvkey1", "endyr1", "begyr2", "gvkey2", "endyr2", "begyr3", "gvkey3", "endyr3", "begyr4", "gvkey4", "endyr4", "begyr5", "gvkey5", "endyr5"))

我想将它整理成长格式。

pdpass列是ID变量。我想为每个begyrs值收集所有endyearsgvkeypdpass值。 但是,正如您从原始数据中看到的那样,某些pdpass只有一个gvkey, endyear, begyears值,而某些值只有一个gvkey2, endyear2, begyear2, gvkey3。 (它们表示为pdpass ......等等。)

基本上,我只想在每个 pdpass gvkeyN gvkey begyearN begyear endyearN endyear 1 -16348 gvkey1 26005 begyear1 1991 endyear1 1993 2 -16348 gvkey2 8852 begyear2 1994 endyear2 1996 3 -16348 gvkey3 3708 begyear3 1997 endyear3 2005 4 -16179 gvkey1 174876 begyear1 2006 endyear1 2006 5 -16161 gvkey1 30964 begyear1 1993 endyear1 2005 6 -15631 gvkey1 111491 begyear1 1996 endyear1 2006 7 -14668 gvkey1 15854 begyear1 1988 endyear1 2003 8 -13980 gvkey1 13218 begyear1 1986 endyear1 1995 9 -13980 gvkey2 13218 begyear1 1996 endyear2 2005 10 -13956 gvkey1 8674 begyear1 1968 endyear2 1987 # code desired<-data.frame(pdpass=c(rep(-16348, 3), -16179, -16161, -15631, -14668, rep(-13980, 2), -13956), gvkeyN=c("gvkey1", "gvkey2", "gvkey3", rep("gvkey1", 5), "gvkey2", "gvkey1"), gvkey=c(26005, 8852, 3708, 174876, 30964, 111491, 15854, 13218, 13218, 8674), begyearN=c("begyear1", "begyear2", "begyear3", rep("begyear1", 5), "begyear1", "begyear1"), begyear=c(1991, 1994, 1997, 2006, 1993, 1996, 1988, 1986, 1996, 1968), endyearN=c("endyear1", "endyear2", endyear3", rep("endyear1", 5), "endyear2", "endyear2"), endyear=c(1993, 1996, 2005, 2006, 2005, 2006, 2003, 1995, 2005, 1987))`

的gvkey-endyear-begyear值以相同的数字结尾时才行

轻松放,这是我想要的输出:

require(tidyr)
d1<-gather(NBER, gvkeyN, gvkey, -pdpass, -endyr1, -begyr1, -endyr2, -begyr2, -endyr3, -begyr3, -endyr4, -begyr4, -endyr5, -begyr5, na.rm=TRUE)
d2<-gather(d1, begyrN, begyear, -gvkeyN, -gvkey, -pdpass, -endyr1, -endyr2, -endyr3, -endyr4, -endyr5, na.rm=TRUE)
d3<-gather(d2, endyrN, endyear, -gvkeyN, -gvkey, -pdpass, -begyrN, -begyear, na.rm=TRUE)
d4<-arrange(d3, pdpass)
head(d4, 10)

pdpass gvkeyN gvkey begyrN begyear endyrN endyear
1  -16348 gvkey1 26005 begyr1    1991 endyr1    1993
2  -16348 gvkey2  8852 begyr1    1991 endyr1    1993
3  -16348 gvkey3  3708 begyr1    1991 endyr1    1993
4  -16348 gvkey1 26005 begyr2    1994 endyr1    1993
5  -16348 gvkey2  8852 begyr2    1994 endyr1    1993
6  -16348 gvkey3  3708 begyr2    1994 endyr1    1993
7  -16348 gvkey1 26005 begyr3    1997 endyr1    1993
8  -16348 gvkey2  8852 begyr3    1997 endyr1    1993
9  -16348 gvkey3  3708 begyr3    1997 endyr1    1993
10 -16348 gvkey1 26005 begyr1    1991 endyr2    1996

这是我到目前为止所尝试的

gvkey, endyear, and begyear

如您所见,我得到了不必要的行。我只需要gvkey 1 begyr 1 endyr 1末尾的数字完全匹配的行。 (例如,gvkey 2 begyr 2 endyr 2; gvkey3 begyr3 endyr3; gather等等。);但是,我得到1,2和3的组合。因此,上面的第2行:9都是不必要的。

也许,我可以以某种方式在收集后手动删除所有不必要的行。但是,我认为可能会更好地利用 ul { } li a, .dropbtn { text-align: center; padding: 14px 16px; text-decoration: none; } .dropdown-content { display: none; position: absolute; background-color: #f9f9f9; min-width: 160px; box-shadow: 0px 8px 16px 0px rgba(0, 0, 0, 0.2); } .dropdown-content a { color: black; padding: 12px 16px; text-decoration: none; display: block; text-align: left; } .dropdown-content a:hover { background-color: #f1f1f1 } .dropdown:hover .dropdown-content { display: block; } ul { margin: 0px; padding: 0px; border: 0px; vertical-align: middle; background-color: #a2ded0; } ul#nlist { text-align: center; position: relative; top: 50%; transform: translateY(-50%); } ul#nlist li { display: inline; } 函数。我已经查看了类似的问题,但无法找到解决此问题的方法。 任何人都可以帮我解决这些问题吗?

1 个答案:

答案 0 :(得分:6)

我们可以使用melt

中的data.table
library(data.table)
melt(setDT(df1), measure = patterns("^gvkey", "^begyr", "^endyr"), 
    na.rm = TRUE, value.name = c("gvkey", "begyear", "endyear"))