我的数据如下:
df <- read.table(header = T, text =
"GeneID Gene_Name Species Paralogues Domains Functional_Diversity
1234 DDR1 hsapiens 14 2 8.597482
5678 CSNK1E celegans 70 4 8.154788
9104 FGF1 Chicken 3 0 5.455874
4575 FGF1 hsapiens 4 6 6.745845")
我需要它看起来像:
Gene_Name hsapiens celegans ggalus
DDR1 8.597482 NA NA
CSNK1E NA 8.154788 NA
FGF1 6.745845 NA 5.455874
我尝试过使用:
library(tidyverse)
df %>%
select(Gene_Name, Species, Functional_Diversity) %>%
spread(Species, Functional_Diversity)
我的实际数据包括130,000行(许多基因名称大约14,000个独特),由9种组成。
当我将此方法应用于我的实际数据时,我得到:
Error: Duplicate identifiers for rows (16691, 19988), (20938, 21033), (1232, 21150), (2763, 21465), (1911, 20844), (17274, 17657, 18293, 18652, 18726, 19006, 19025), (496, 22555), (17227, 17608, 18211, 18605, 18676, 18967, 19002), (13569, 21807), (10261, 21014, 21607), (20816, 21553), (2244, 22025), (6194, 21910), (12217, 21555), (2936, 21078), (16484, 20911), (12216, 21851), (9289, 21791), (10340, 21752), (1714, 22077), (13216, 22618), (6076, 22371), (14731, 21717), (160, 22472), (11553, 22635), (17183, 17583, 18510, 18608, 18661, 18896, 19108), (138, 20028), (17185, 17584, 18330, 18415, 18500, 18981, 19063), (9726, 22440), (17238, 17617, 18905, 18960, 18996, 19134), (1638, 21645), (4631, 20821), (9162, 22463), (319, 20900), (13600, 22227), (9312, 20011), (14825, 21711, 21764), (3381, 21134), (505, 21133), (5954, 20013), (5948, 21313), (17233, 17612, 18187, 18311, 18411, 18708, 18980), (16953, 20902, 21845), (20710, 22477), (20519, 20973), (10204, 21197, 21213), (2933, 20707), (4302,
答案 0 :(得分:1)
要查看具有“重复标识符”的行,您可以使用...
df %>%
group_by(Gene_Name, Species) %>%
mutate(n = n()) %>%
filter(n > 1)
为了确保spread
正常工作,即使您有重复标识符的行,也可以添加行号列,以保证每行都是唯一的...
df %>%
select(Gene_Name, Species, Functional_Diversity) %>%
mutate(row = row_number()) %>%
spread(Species, Functional_Diversity)