我有一个data.frame
有2列,其中重复第二列中的值。例如:
HUGO Cell
1 CD28 T cells
2 CD3D T cells
3 CD3G T cells
4 CD8A lymphocytes
5 EOMES lymphocytes
6 FGFBP2 lymphocytes
7 GNLY lymphocytes
8 NCR1 NK cells
9 PTGDR NK cells
10 SH2D1B NK cells
我希望列HUGO中对应于列单元格中唯一名称的所有值都进入每个唯一名称后的名称列表。
例如
T cells: CD28 CC3D C34
lymphocytes: CD8A EOMES FGFBP2 FGFBP2 GNLY
...
我试过了
reshape(data.frame, timevar = "HUGO",idvar = "Cell",direction = "wide")
但它只返回Cell列中每个名称的值数。
答案 0 :(得分:3)
根据您的需要,这里有一些可能性。前5个没有使用包。
1)aggregate / c 这给出了一个数据框,其第二列是HUGO名称的字符向量。
aggregate(HUGO ~ Cell, DF, c)
,并提供:
Cell HUGO
1 lymphocytes CD8A, EOMES, FGFBP2, GNLY
2 NK cells NCR1, PTGDR, SH2D1B
3 T cells CD28, CD3D, CD3G
2)aggregate / toString 这给出了一个数据框,其第二列包含用逗号分隔HUGO名称的字符串。
aggregate(HUGO ~ Cell, DF, toString)
,并提供:
Cell HUGO
1 lymphocytes CD8A, EOMES, FGFBP2, GNLY
2 NK cells NCR1, PTGDR, SH2D1B
3 T cells CD28, CD3D, CD3G
3)unstack 这给出了一个列表,每个Cell一个组件,其组件都是该Cell的HUGO名称。
unstack(DF)
,并提供:
$lymphocytes
[1] "CD8A" "EOMES" "FGFBP2" "GNLY"
$`NK cells`
[1] "NCR1" "PTGDR" "SH2D1B"
$`T cells`
[1] "CD28" "CD3D" "CD3G"
4)tapply 这会产生一个矩阵,其行为单元格,其列是HUGO名称的序号。
DF2 <- transform(DF, seq = ave(seq_along(HUGO), Cell, FUN t= seq_along))
tapply(DF2$HUGO, DF2[-1], c)
,并提供:
seq
Cell 1 2 3 4
lymphocytes "CD8A" "EOMES" "FGFBP2" "GNLY"
NK cells "NCR1" "PTGDR" "SH2D1B" NA
T cells "CD28" "CD3D" "CD3G" NA
5)重塑这会使用上一个备选项中的DF2
和reshape
来提供数据框:
reshape(DF2, timevar = "seq", idvar = "Cell", dir = "wide")
,并提供:
Cell HUGO.1 HUGO.2 HUGO.3 HUGO.4
1 T cells CD28 CD3D CD3G <NA>
4 lymphocytes CD8A EOMES FGFBP2 GNLY
8 NK cells NCR1 PTGDR SH2D1B <NA>
6)传播这会将"tbl_df"
类对象作为输出(它是"data.frame"
的子类)
library(dplyr)
library(tidyr)
DF %>%
group_by(Cell) %>%
mutate(seq = 1:n()) %>%
ungroup() %>%
spread(seq, HUGO)
,并提供:
Cell 1 2 3 4
1 lymphocytes CD8A EOMES FGFBP2 GNLY
2 NK cells NCR1 PTGDR SH2D1B <NA>
3 T cells CD28 CD3D CD3G <NA>
7)read.zoo read.zoo
给出一个动物园对象,其时间是单元格。
由于时间实际上是字符串,我们使用FUN=identity
来避免解释。 fortify.zoo
将其转换为数据框。 DF2
来自上方。
library(zoo)
fortify.zoo(read.zoo(DF2, split = "seq", index = "Cell", FUN = identity))
,并提供:
Index 1 2 3 4
1 lymphocytes CD8A EOMES FGFBP2 GNLY
2 NK cells NCR1 PTGDR SH2D1B <NA>
3 T cells CD28 CD3D CD3G <NA>
8)dcast 这会将data.table作为输出。
library(data.table)
DT <- data.table(DF)
DT[, seq:=1:.N, by = Cell]
dcast(DT, Cell ~ seq, value.var = "HUGO")
,并提供:
Cell 1 2 3 4
1: NK cells NCR1 PTGDR SH2D1B NA
2: T cells CD28 CD3D CD3G NA
3: lymphocytes CD8A EOMES FGFBP2 GNLY
注意:强>
DF <- structure(list(HUGO = c("CD28", "CD3D", "CD3G", "CD8A", "EOMES",
"FGFBP2", "GNLY", "NCR1", "PTGDR", "SH2D1B"), Cell = c("T cells",
"T cells", "T cells", "lymphocytes", "lymphocytes", "lymphocytes",
"lymphocytes", "NK cells", "NK cells", "NK cells")), .Names = c("HUGO",
"Cell"), class = "data.frame", row.names = c(NA, -10L))