使用行名称作为列标题将长格式数据重新整形为宽格式

时间:2016-01-26 22:40:05

标签: r reshape spread

我已经在StackOverFlow上查看了以下问题,并链接到其他R帮助书:

R: Reshape Data Long to Wide - understanding reshape parameters

Reshape long to wide with multiple groupings

How to reshape data from long to wide format?

http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/

我想将长格式的数据(两列,6200个条目)与列“id”中的非唯一名称一起使用,并在“序列”中使用不同的序列,并重新整理为宽格式,其中列标题为现在是“id”,每个id下面列出了“sequence”中的所有序列。

          id    sequence
1   CK1alpha TPSIAsDISLP
2   CK1alpha IASDIsLPIAT
3       CDK1 SVPSSsPGTSV
4       CDK1 EGCQGsPQRRG
5   CK1alpha DICEDsDIDGD
6 PKCepsilon IHGSDsVKSAE

我想得到什么:

id          CK1alpha    CDK1        PKCepsilon
sequence    TPSIAsDISLP SVPSSsPGTSV
            IASDIsLPIAT EGCQGsPQRRG
            DICEDsDIDGD 

我尝试使用reshape

kinase_sub_wide <- reshape(kinase_substrate, idvar = "id", timevar = "sequence", direction = "wide")

然而,我收到多行匹配的警告消息:

Warning messages:
1: In reshapeWide(data, idvar = idvar, timevar = timevar,  ... :
  multiple rows match for sequence=TPSIAsDISLP: first taken
2: In reshapeWide(data, idvar = idvar, timevar = timevar,  ... :
  multiple rows match for sequence=IASDIsLPIAT: first taken
3: In reshapeWide(data, idvar = idvar, timevar = timevar,  ... :
  multiple rows match for sequence=RSQSRsNSPLP: first taken

我还尝试使用spread

kinase_substrate_wide <- spread(kinase_substrate, id, sequence)

但是使用重复的标识符会出错:

> kinase_substrate_wide <- spread(kinase_substrate, id, sequence)
Error: Duplicate identifiers for rows (1812, 1813, 4469), (906, 3349), (253,     285, 2114, 2174, 3022, 4385, 4501), (155, 203, 218, 261, 316, 542, 682, 1021, 1123, 1238, 1492, 1919, 1938, 1997, 2064, 2139, 2323, 2387, 2597, 2826, 3058, 3377, 3899, 4024, 4135, 4241, 4314, 4617, 4733, 5055, 5289, 5467, 5726, 5952, 6165), (72, 272, 749, 1100, 2792, 3573, 3858, 4254, 4257), (209, 548, 637, 653, 1034, 1038, 1213, 1387, 1445, 1475, 1476, 1692, 1735, 2635, 3180, 4005, 4661, 4988, 5672, 5870, 6042), (21, 1802), (23, 24, 30, 49, 60, 86, 122, 127, 137, 177, 182, 227, 250, 260, 268, 270, 299, 347, 356, 361, 400, 424, 425, 448, 483, 488, 494, 509, 510, 512, 522, 523, 524, 540, 559, 572, 612, 614, 616, 622, 720, 750, 774, 794, 816, 820, 829, 866, 868, 912, 916, 918, 940, 946, 955, 962, 984, 992, 1004, 1013, 1054, 1055, 1070, 1073, 1083, 1086, 1105, 1140, 1154, 1164, 1179, 1222, 1228, 1230, 1284, 1295, 1316, 1318, 1333, 1334, 1348, 1356, 1375, 1383, 1389, 1390, 1406, 1421, 1444, 1458, 1473, 1474, 1490

如何使用上述任何一种函数将数据转换为宽格式并将其对应的每个序列放在id列?

提前致谢。

编辑#1

使用建议来包含David的评论索引让我在那里

reshape(transform(df, indx = ave(as.character(id), id, FUN = seq)), idvar = "indx", timevar = "id", direction = "wide")

导致:

   indx sequence.CK1alpha sequence.CDK1 sequence.PKCepsilon sequence.GRK2 sequence.ICK sequence.CDK5 sequence.PKCbeta sequence.PAK1 sequence.GSK3beta
1     1       TPSIAsDISLP   SVPSSsPGTSV         IHGSDsVKSAE   DIDESsPGTEW  VDRLQsEPESI   AQAPSsPRVTE      GAQAPsSPRVT   AQERPsQAAPA       NIDNLsPKASH
2     2       IASDIsLPIAT   EGCQGsPQRRG         KLSGLsFKRNR   EKKEEsEESDD  DNRVPsPPPTG   PAEVKsPEKAK      DESTGsIAKRL   RSRTPsASNDD       FNYNPsPRKSS
5     3       DICEDsDIDGD   TLNSGsPEKTC         TALAPsTMKIK   EESEEsDDDMG  PDTKDsPVCPH   QKPAAsPRPRR      IVENLsSRCSW   KQKVDsLLENL       SSGAKsPSKSG
7     4       TFEDLsDVEGG   HVAVSsPTPET         VAKRLsLTMGG   MNSSIsSGSGS  LKVEGsPTEEA   DFTCGsPTAAG      YPVSPsDKVLI   RALRAsESGI_       FPDDLsLDHSD
16    5       PRSGRsPTGNT   TEVPRsPKHAH         EKLVLsKLYEE   RPTSIsWDGLD  ESERGsGSQSS   SDTVTsPQRAG      EKKVVsLNGEL   PGSPLsSQPVL       YSDSIsPFNKS
29    6       MSDTGsPGMQR   KYSPTsPTYSP         EILNRsPRNRK   KNRPTsISWDG         <NA>   GRGAEsPFEEK      LVNSAsAQKRS   SSKTAsLPGYG       PSRTAsFSESR

编辑#2

重塑功能是否有办法避免输入“序列”。在每个名字前面?或者我是否必须转向正则表达式来重命名所有列名称?

编辑#3

使用gsub从列名中删除"sequence."并将其分配给变量:

new_col_names <- names(DF) <- gsub("sequence.", "", names(DF))

然后将new_col_names应用于数据框

colnames(DF) <- new_col_names

感谢您帮助我所有人!

0 个答案:

没有答案