将列的内容拆分为两行,以便转换为STRUCTURE格式

时间:2016-01-18 19:53:04

标签: r dataframe data.table

我正在尝试将列的内容分成两行,并复制行名称。每个变量只包含两个数字(11,12,13,14,21,22等或NA)。这是为了转换为STRUCTURE格式,这是一种常见的群体遗传格式。

我有这个:

population      X354045  X430045   X995019
Crater          <NA>     11        22
Teton           11       31        11

我想有这个:

population      X354045  X430045   X995019
Crater          <NA>     1         2
Crater          <NA>     1         2
Teton           1        3         1
Teton           1        1         1

3 个答案:

答案 0 :(得分:2)

这是一个data.table问题,因此我只建议内置tstrsplit函数

阅读您的数据

library(data.table)
DT <- fread('population      X354045  X430045   X995019
Crater          NA     11        22
                 Teton           11       31        11')

解决方案(如果您有data.frame,请使用setDT(DT)转换为data.table

DT[, lapply(.SD, function(x) unlist(tstrsplit(x, ""))), by = population]
#    population X354045 X430045 X995019
# 1:     Crater      NA       1       2
# 2:     Crater      NA       1       2
# 3:      Teton       1       3       1
# 4:      Teton       1       1       1

答案 1 :(得分:1)

好的,所以我会这样做。让我们创建一些数据:

vector <- c(10, 11, 12, NA, 13, 14, 15)

首先,我们需要一个函数,允许您将每个两位数字分成两位数(并将NA分成两个NA):

as.numeric(sapply(vector, function(x) (x %% c(1e2,1e1)) %/% c(1e1,1e0)))
# 1  0  1  1  1  2 NA NA  1  3  1  4  1  5

现在我们所要做的就是将其应用于每个相关专栏:

DF <- data.frame(population = c("Crater", "Teton"), X354045 = c(NA, 11), X430045 = c(11, 31), X995019 = c(22, 11))
DF2 <- apply(DF[-1], 2, function(y) as.numeric(sapply(y, function(x) (x %% c(1e2,1e1)) %/% c(1e1,1e0))))

最后,我们将它与新的人口列结合起来:

population <- as.character(rep(DF$population, each = 2))
DF3 <- cbind(population, data.frame(DF2))

答案 2 :(得分:1)

dd <- read.table(header = TRUE, text = 'population      X354045  X430045   X995019
Crater          NA     11        22
Teton           11       31        11')

nr <- nrow(dd)
dd <- dd[rep(1:2, each = nr), ]

#     population X354045 X430045 X995019
# 1       Crater      NA      11      22
# 1.1     Crater      NA      11      22
# 2        Teton      11      31      11
# 2.1      Teton      11      31      11


dd[, -1] <- lapply(dd[, -1], function(x) {
  idx <- (seq_along(x) %% 2 == 0) + 1L
  substr(x, idx, idx)
})

#     population X354045 X430045 X995019
# 1       Crater    <NA>       1       2
# 1.1     Crater    <NA>       1       2
# 2        Teton       1       3       1
# 2.1      Teton       1       1       1

或者只是

dd <- dd[rep(1:2, each = nr), ]
dd[, -1] <- lapply(dd[, -1], function(x)
  Vectorize(substr)(x, rep(1:2, nr), rep(1:2, nr)))

会起作用

感谢@DavidArenburg

data.table中有同样的想法
library('data.table')
dd <- read.table(header = TRUE, text = 'population      X354045  X430045   X995019
    Crater          NA     11        22
                 Teton           11       31        11')


setDT(dd)[rep(1:2, each = .N), lapply(.SD, substr, 1:2, 1:2), by = population]

#    population X354045 X430045 X995019
# 1:     Crater      NA       1       2
# 2:     Crater      NA       1       2
# 3:      Teton       1       3       1
# 4:      Teton       1       1       1

或类似地,但避免by部分

dd <- setDT(dd)[rep(1:2, each = .N)]
dd[, 2:4 := dd[ ,lapply(.SD, substr, 1:2, 1:2), .SD = -1]]

如果您正在使用大型数据集,这应该非常快/有效