如何替换r中大数据帧中的字符

时间:2014-07-14 13:17:50

标签: r

我有一个大尺寸为42500 x 18500的大数据框。我试图替换一些字符,但它太大而无法使用矩阵。我想问一个人可以帮助我。谢谢。 以下脚本仅在我的计算机中使用小尺寸(例如42500 x 3000)的数据框中。

   geno <- read.table("datatable.txt",sep="\t", head=TRUE)
   M <- as.matrix(geno[,c(3:dim(geno)[2])])        
   M[M == "U"] <- "N"                  ## Replace "U" with "N"
   H <- which(M == "H", arr.ind=TRUE)  ## Identify the Hs
   M[H] <- geno[cbind(H[, "row"], 2)]    ## Replace with H values from "type" column
   dat <- cbind(geno[1], M)

数据框如下所示:

   SNP_ID Type    Line1   Line2   Line3   Line4   Line5   Line6
   SNP1   K   T   G   T   U   T   T
   SNP2   M   A   U   A   A   H   C
   SNP3   M   A   A   A   C   A   A
   SNP4   K   T   H   T   G   T   T
   SNP5   K   U   T   T   T   T   H
   SNP6   M   A   U   A   A   C   A

在整个数据框上运行上述脚本时,出现错误:

Error: cannot allocate vector of size 8.0 Gb
In addition: Warning messages:
1: In structure(.Call(C_objectSize, x), class = "object_size") :
  Reached total allocation of 24573Mb: see help(memory.size)...

2 个答案:

答案 0 :(得分:2)

您可以使用mutate_each_q包中的dplyr来优雅地执行此操作:

library(dplyr)
geno %>% mutate_each_q(funs(ifelse(.=="U","N",ifelse(.=="H",Type,.))),names(geno)[-(1:2)])
  SNP_ID Type Line1 Line2 Line3 Line4 Line5 Line6
1   SNP1    K     T     G     T     N     T     T
2   SNP2    M     A     N     A     A     M     C
3   SNP3    M     A     A     A     C     A     A
4   SNP4    K     T     K     T     G     T     T
5   SNP5    K     N     T     T     T     T     K
6   SNP6    M     A     N     A     A     C     A

答案 1 :(得分:1)

试试这个。

d<-read.table(text="
SNP_ID  Type    Line1   Line2   Line3   Line4   Line5   Line6
    SNP1    K   T   G   T   U   T   T
    SNP2    M   A   U   A   A   H   C
    SNP3    M   A   A   A   C   A   A
    SNP4    K   T   H   T   G   T   T
    SNP5    K   U   T   T   T   T   H
    SNP6    M   A   U   A   A   C   A", header=TRUE, colClasses="character")

d[which(d[,1:dim(d)[2]] == "U", arr.ind=TRUE)] <- "N"
d[which(d[,1:dim(d)[2]] == "H", arr.ind=TRUE)] <- 
    d[which(d[,1:dim(d)[2]] == "H", arr.ind=TRUE)[,'row'], 'Type']

输出:

  SNP_ID Type Line1 Line2 Line3 Line4 Line5 Line6
1   SNP1    K     T     G     T     N     T     T
2   SNP2    M     A     N     A     A     M     C
3   SNP3    M     A     A     A     C     A     A
4   SNP4    K     T     K     T     G     T     T
5   SNP5    K     N     T     T     T     T     K
6   SNP6    M     A     N     A     A     C     A

编辑:您可能希望尝试按如下方式导入数据,以确保您没有读入因素,这可能会使用不必要的内存:

d <- read.table("datatable.txt", sep="\t", header=TRUE, colClasses="character")