Question

我真的知道＆＃39;大矩阵问题＆＃39;这里是一个经常性的话题，但我想详细解释一下我对大矩阵的具体问题。

严格地说，我想在{R}中使用特定名称模式cbind几个大型矩阵。下面的代码显示了我最好的尝试，直到这一点。

首先让我们生成文件来模仿我的真实矩阵：

# The df1
df1 <- '######## infx infx infx
######## infx infx infx
probeset_id sample1 sample2 sample3
PR01           1       2       0
PR02           -1      2       0
PR03            2      1       1
PR04           1       2       1
PR05           2       0       1'
df1 <- read.table(text=df1, header=T, skip=2)
write.table(df1, "df1.txt", col.names=T, row.names=F, quote=F, sep="\t")

# The df2 
df2 <- '######## infx infx infx
######## infx infx infx
probeset_id sample4 sample5 sample6
PR01           2       2       1
PR02           2      -1       0
PR03            2      1       1
PR04           1       2       1
PR05           0       0       1'
df2 <- read.table(text=df2, header=T, skip=2)
write.table(df2, "df2.txt", col.names=T, row.names=F, quote=F, sep="\t")

# The dfn 
dfn <- '######## infx infx infx
######## infx infx infx
probeset_id samplen1 samplen2 samplen3
PR01           2       -1       1
PR02           1      -1       0
PR03            2      1       1
PR04           1       2       -1
PR05           0       2       1'
dfn <- read.table(text=dfn, header=T, skip=2)
write.table(dfn, "dfn.txt", col.names=T, row.names=F, quote=F, sep="\t")

然后将其导入R并写为我期望的output文件：

### Importing and excluding duplicated 'probeset_id' column
calls = list.files(pattern="*.txt")
library(data.table)
calls = lapply(calls, fread, header=T)
mycalls <- as.data.frame(calls)
probenc <- as.data.frame(mycalls[,1])
mycalls <- mycalls[, -grep("probe", colnames(mycalls))]
output <- cbind(probenc, mycalls)
names(output)[1] <- "probeset_id"
write.table(output, "output.txt", col.names=T, row.names=F, quote=F, sep="\t")

输出结果如何：

> head(output)
  probeset_id sample1 sample2 sample3 sample4 sample5 sample6 samplen1 samplen2 samplen3
1        PR01       1       2       0       2       2       1        2       -1        1
2        PR02      -1       2       0       2      -1       0        1       -1        0
3        PR03       2       1       1       2       1       1        2        1        1
4        PR04       1       2       1       1       2       1        1        2       -1
5        PR05       2       0       1       0       0       1        0        2        1

这段代码非常适合我想要做的事情，但是，我使用我的真实数据面对已知的R内存限制（超过30＆＃34; df＆＃34;〜1.3GB的对象或/和600k行，每列100列）。

我读到了一个非常有趣的SQL方法（R: how to rbind two huge data-frames without running out of memory），但我对SQL缺乏经验，并没有找到一种方法来适应我的情况。

干杯，

Answer 1

我曾误解过这个问题;现在评论明确表示。您需要的是使用像ff这样的包。这使您可以处理硬盘中的文件，而不是将它们加载到RAM中。当您提到RAM不足以加载系统中的所有文件时，这看起来像是您的问题的解决方案。

首先使用read.table.ffdf加载文件，然后使用以下内容将它们合并在一起：

#load files in R
library(ff)

df1 <- read.table.ffdf('df1.txt', header=T, skip=2)
df2 <- read.table.ffdf('df2.txt', header=T, skip=2)
dfn <- read.table.ffdf('dfn.txt', header=T, skip=2)

然后像这样合并：

mergedf <- do.call('ffdf', c(physical(df1), physical(df2), physical(dfn)))

不幸的是，我无法使用您的示例，因为read.table.ffdf不支持text参数，但上述内容应该可行。 ff包具有自己的（不是非常复杂的）语法，您可能需要熟悉它，因为它可以处理硬盘上的文件。例如，apply函数使用ffapply函数完成，其方式与apply几乎相同。

查看ff包上的一些基本教程，查看here，here和here。

您还可以查看软件包中的功能，并使用内置帮助来帮助自己ls(package:ff)。

按列绑定几个大型矩阵

1 个答案: