在R中从宽到长重塑

时间:2014-09-10 14:39:15

标签: r reshape

我正在尝试学习R并且有一个关于重塑以下数据集的问题。

bankname,date,year,month,quarter,totalliabilities,corr1,amt1,corr2,amt2
Bank of Pittsgurgh,2/7/1950,1950,2,1,237991,#N/A,#N/A,#N/A,#N/A
Bank of Pittsgurgh,5/2/1950,1950,5,2,258865,#N/A,#N/A,#N/A,#N/A
Bank of Pittsgurgh,8/7/1950,1950,8,3,218524,#N/A,#N/A,#N/A,#N/A,#N/A
Bank of Pittsgurgh,11/6/1950,1950,11,4,237520,First Bank,17472,Third Bank,30711
The Arsenal Bank,2/2/1950,1950,2,1,218508,#N/A,#N/A,#N/A,#N/A
The Arsenal Bank,5/3/1950,1950,5,2,224110,#N/A,#N/A,#N/A,#N/A
The Arsenal Bank,8/2/1950,1950,8,3,216071,#N/A,#N/A,#N/A,#N/A
The Arsenal Bank,11/1/1950,1950,11,4,226166,National Bank,20966,Trust Company,873

当我运行以下代码重塑时,我收到以下错误。我怎样才能解决这个问题? 另外,我想将amt变量解析为数值变量并删除此数据集中的#NA。我怎样才能解析这个变量?

- 首先我尝试创建“id”

bank_test2$id<-as.numeric(as.factor(bank_test2$bankname))

- 然后我尝试使用年份和季度创建一个唯一的时间变量

bank_test2$yq<-as.factor(paste(as.character(bank_test2$year),as.character(bank_test2$quarter)))   
bank_test2<-bank_test2[with(bank_test2, order(yq,id)),]   

- 塑造数据

v <- outer(c("corr", "amt"), c(1:2), FUN=paste0)   
bank_test2<-reshape(bank_test2, direction='long', varying=c(v), sep='')      


Error in `row.names<-.data.frame`(`*tmp*`, value = paste(d[, idvar], times[1L],  : 
  duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘1.1’, ‘2.1’ 

id, bankname,   date,   year,   month,  quarter,    totalliabilities,   node,   corr,   amt      
1,  Bank of Pittsgurgh, 2/7/1950,   1950,   2,  1,  237991, 1,  #N/A,   #N/A      
1,  Bank of Pittsgurgh, 5/2/1950,   1950,   5,  2,  258865, 1,  #N/A,   #N/A   
1,  Bank of Pittsgurgh, 8/7/1950,   1950,   8,  3,  218524, 1,  #N/A,   #N/A   
1,  Bank of Pittsgurgh, 11/6/1950,  1950,   11, 4,  237520, 1,  First Bank, 21906   
1,  Bank of Pittsgurgh, 2/7/1950,   1950,   2,  1,  237991, 2,  #N/A,   #N/A   
1,  Bank of Pittsgurgh, 5/2/1950,   1950,   5,  2,  258865, 2,  #N/A,   #N/A   
1,  Bank of Pittsgurgh, 8/7/1950,   1950,   8,  3,  218524, 2,  #N/A,   #N/A   
1,  Bank of Pittsgurgh, 11/6/1950,  1950,   11, 4,  237520, 2,  Third Bank, 4442   
2,  The Arsenal Bank,   2/2/1950,   1950,   2,  1,  218508, 1,  #N/A,   #N/A   
2,  The Arsenal Bank,   5/3/1950,   1950,   5,  2,  224110, 1,  #N/A,   #N/A   
2,  The Arsenal Bank,   8/2/1950,   1950,   8,  3,  216071, 1,  #N/A,   #N/A   
2,  The Arsenal Bank,   11/1/1950,  1950,   11, 4,  226166, 1,  National Bank, 43224      
2,  The Arsenal Bank,   2/2/1950,   1950,   2,  1,  218508, 2,  #N/A,   #N/A   
2   The Arsenal Bank,   5/3/1950,   1950,   5,  2,  224110, 2,  #N/A,   #N/A   
2   The Arsenal Bank,   8/2/1950,   1950,   8,  3,  216071, 2,  #N/A,   #N/A   
2   The Arsenal Bank,   11/1/1950,  1950,   11, 4,  226166, 2,  Trust Company,  3682   

我希望以这种方式组织数据,使用“bankname”中新创建的bankid,并使用id和time值创建唯一的rownames。然后我想删除数据集中的所有#NA 我该怎么办?

提前谢谢。

2 个答案:

答案 0 :(得分:0)

这个特殊的错误是抱怨rownames不是唯一的。为避免这种情况,您需要将每行的唯一ID重新整形为“idvar”。最好的方法是在原始数据框架中创建具有此唯一ID的新列,但您也可以使用任何其他唯一的字段。例如,资产负债在您的数据框中是唯一的,因此您可以使用:

bank_test2<-reshape(bank_test2, direction='long', varying=c(v), sep='',idvar="totalliabilities")

这显然不是身份证的最佳选择,但我希望能指出正确的方向。

答案 1 :(得分:0)

我试图以易于使用和重现的方式提供数据。然后我获取了您的数据的一部分b,并尝试将其设置为长格式。不确定它是否是所需的输出。

library(reshape2)
library(stringr)

a <- structure(list(bankname = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,2L, 2L, 2L, 2L), .Label = c("Bank of Pittsgurgh", "The Arsenal Bank_Pittsburgh"), class "factor"), date = structure(c(2L, 3L, 6L, 8L, 9L, 12L,13L, 15L, 1L, 4L, 5L, 7L, 10L, 11L, 14L, 16L), .Label = c("1950/02/02", "1950/02/07", "1950/05/02", "1950/05/03", "1950/08/02", "1950/08/07", "1950/11/01", "1950/11/06", "1951/02/05", "1951/02/06", "1951/05/01", "1951/05 07", "1951/08/06", "1951/08/07", "1951/11/03", "1951/11/06"), class = "factor"), year = c(1950L, 1950L, 1950L, 1950L, 1951L, 1951L, 1951L, 1951L, 1950L, 1950L, 1950L, 1950L, 1951L, 1951L, 1951L, 1951L), month = c(2L, 5L, 8L, 11L, 2L, 5L, 8L, 11L, 2L, 5L, 8L, 11L, 2L, 5L, 8L, 11L), quarter = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), totalliabilities = c(237991.5469, 258865.6563, 218524, 237520.5469, 276052.1875, 255812.7031, 62426.625, 272447.375, 218508.4844, 224110.5156, 216071.9063, 226166.7969, 244241.625, 228508.0625, 254008.8594, 268540.1563), corr1 = structure(c(1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 3L ), .Label = c("#N/A", "First National Bank", "National Bank of Commerce" ), class = "factor"), amt1 = structure(c(1L, 1L, 1L, 2L, 1L, 1L, 1L, 4L, 1L, 1L, 1L, L, 1L, 1L, 1L, 5L), .Label = c("#N/A", "17472.98047", "20966.50977", "21906.07031",  43224.62891" ), class = "factor"), corr2 = structure(c(1L, 1L, 1L, 2L, 1L, 1L, 1L, 3L, 1L, 1L,  L, 5L, 1L, 1L, 1L, 4L), .Label = c("#N/A", "Third National Bank", "Third National Bank", "Union Trust Company", "Unit Trust Company Of New York"), class = "factor"), amt2 = structure(c(1L,  1L, 1L, 2L, 1L, 1L, 1L, 4L, 1L, 1L, 1L, 5L, 1L, 1L, 1L, 3L ), .Label = c("#N/A", "30711.35938", "3682.449951", "4442.399902", "873.1699829"), class = "factor"), X = structure(c(1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "#N/A"), class = "factor"), id = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2)), .Names = c("bankname", "date", year",  month", "quarter", "totalliabilities", "corr1", "amt1", "corr2", "amt2", "X", "id"), row.names = c(NA, 16L), class = "data.frame")


b<- a[c(8,12,16),c(1,2,7,8,9,10)]
b
# put the data related to corr1 and amt1 in one column type1 same for type2
b$type1 <-  paste0(b$corr1,"|",b$amt1)
b$type2 <- paste0(b$corr2,"|",b$amt2)

# melt the types together
c<- melt(b, measure.vars=c(7,8))

c
# split them them back
long <- data.frame(str_split_fixed(c$value,"\\|",2))
d <- cbind(c,long)

d[,c(1,9,10)]


#                     bankname                             X1          X2
#1          Bank of Pittsgurgh            First National Bank 21906.07031
#2 The Arsenal Bank_Pittsburgh      National Bank of Commerce 20966.50977
#3 The Arsenal Bank_Pittsburgh      National Bank of Commerce 43224.62891
#4          Bank of Pittsgurgh            Third National Bank 4442.399902
#5 The Arsenal Bank_Pittsburgh Unit Trust Company Of New York 873.1699829
#6 The Arsenal Bank_Pittsburgh            Union Trust Company 3682.449951