我已经在StackOverFlow上查看了以下问题,并链接到其他R帮助书:
R: Reshape Data Long to Wide - understanding reshape parameters
Reshape long to wide with multiple groupings
How to reshape data from long to wide format?
http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/
我想将长格式的数据(两列,6200个条目)与列“id”中的非唯一名称一起使用,并在“序列”中使用不同的序列,并重新整理为宽格式,其中列标题为现在是“id”,每个id下面列出了“sequence”中的所有序列。
id sequence
1 CK1alpha TPSIAsDISLP
2 CK1alpha IASDIsLPIAT
3 CDK1 SVPSSsPGTSV
4 CDK1 EGCQGsPQRRG
5 CK1alpha DICEDsDIDGD
6 PKCepsilon IHGSDsVKSAE
我想得到什么:
id CK1alpha CDK1 PKCepsilon
sequence TPSIAsDISLP SVPSSsPGTSV
IASDIsLPIAT EGCQGsPQRRG
DICEDsDIDGD
我尝试使用reshape
kinase_sub_wide <- reshape(kinase_substrate, idvar = "id", timevar = "sequence", direction = "wide")
然而,我收到多行匹配的警告消息:
Warning messages:
1: In reshapeWide(data, idvar = idvar, timevar = timevar, ... :
multiple rows match for sequence=TPSIAsDISLP: first taken
2: In reshapeWide(data, idvar = idvar, timevar = timevar, ... :
multiple rows match for sequence=IASDIsLPIAT: first taken
3: In reshapeWide(data, idvar = idvar, timevar = timevar, ... :
multiple rows match for sequence=RSQSRsNSPLP: first taken
我还尝试使用spread
kinase_substrate_wide <- spread(kinase_substrate, id, sequence)
但是使用重复的标识符会出错:
> kinase_substrate_wide <- spread(kinase_substrate, id, sequence)
Error: Duplicate identifiers for rows (1812, 1813, 4469), (906, 3349), (253, 285, 2114, 2174, 3022, 4385, 4501), (155, 203, 218, 261, 316, 542, 682, 1021, 1123, 1238, 1492, 1919, 1938, 1997, 2064, 2139, 2323, 2387, 2597, 2826, 3058, 3377, 3899, 4024, 4135, 4241, 4314, 4617, 4733, 5055, 5289, 5467, 5726, 5952, 6165), (72, 272, 749, 1100, 2792, 3573, 3858, 4254, 4257), (209, 548, 637, 653, 1034, 1038, 1213, 1387, 1445, 1475, 1476, 1692, 1735, 2635, 3180, 4005, 4661, 4988, 5672, 5870, 6042), (21, 1802), (23, 24, 30, 49, 60, 86, 122, 127, 137, 177, 182, 227, 250, 260, 268, 270, 299, 347, 356, 361, 400, 424, 425, 448, 483, 488, 494, 509, 510, 512, 522, 523, 524, 540, 559, 572, 612, 614, 616, 622, 720, 750, 774, 794, 816, 820, 829, 866, 868, 912, 916, 918, 940, 946, 955, 962, 984, 992, 1004, 1013, 1054, 1055, 1070, 1073, 1083, 1086, 1105, 1140, 1154, 1164, 1179, 1222, 1228, 1230, 1284, 1295, 1316, 1318, 1333, 1334, 1348, 1356, 1375, 1383, 1389, 1390, 1406, 1421, 1444, 1458, 1473, 1474, 1490
如何使用上述任何一种函数将数据转换为宽格式并将其对应的每个序列放在id列?
提前致谢。
编辑#1
使用建议来包含David的评论索引让我在那里
reshape(transform(df, indx = ave(as.character(id), id, FUN = seq)), idvar = "indx", timevar = "id", direction = "wide")
导致:
indx sequence.CK1alpha sequence.CDK1 sequence.PKCepsilon sequence.GRK2 sequence.ICK sequence.CDK5 sequence.PKCbeta sequence.PAK1 sequence.GSK3beta
1 1 TPSIAsDISLP SVPSSsPGTSV IHGSDsVKSAE DIDESsPGTEW VDRLQsEPESI AQAPSsPRVTE GAQAPsSPRVT AQERPsQAAPA NIDNLsPKASH
2 2 IASDIsLPIAT EGCQGsPQRRG KLSGLsFKRNR EKKEEsEESDD DNRVPsPPPTG PAEVKsPEKAK DESTGsIAKRL RSRTPsASNDD FNYNPsPRKSS
5 3 DICEDsDIDGD TLNSGsPEKTC TALAPsTMKIK EESEEsDDDMG PDTKDsPVCPH QKPAAsPRPRR IVENLsSRCSW KQKVDsLLENL SSGAKsPSKSG
7 4 TFEDLsDVEGG HVAVSsPTPET VAKRLsLTMGG MNSSIsSGSGS LKVEGsPTEEA DFTCGsPTAAG YPVSPsDKVLI RALRAsESGI_ FPDDLsLDHSD
16 5 PRSGRsPTGNT TEVPRsPKHAH EKLVLsKLYEE RPTSIsWDGLD ESERGsGSQSS SDTVTsPQRAG EKKVVsLNGEL PGSPLsSQPVL YSDSIsPFNKS
29 6 MSDTGsPGMQR KYSPTsPTYSP EILNRsPRNRK KNRPTsISWDG <NA> GRGAEsPFEEK LVNSAsAQKRS SSKTAsLPGYG PSRTAsFSESR
编辑#2
重塑功能是否有办法避免输入“序列”。在每个名字前面?或者我是否必须转向正则表达式来重命名所有列名称?
编辑#3
使用gsub
从列名中删除"sequence."
并将其分配给变量:
new_col_names <- names(DF) <- gsub("sequence.", "", names(DF))
然后将new_col_names
应用于数据框
colnames(DF) <- new_col_names
感谢您帮助我所有人!