Question

我正在使用ArrayExpress数据集来构建数据帧，以便我可以在基因模式下运行。

在我的文件夹 GSE11000 中，有一堆文件，文件名在这个模式中，

$string = "the t the";
$result = 'no';
if (preg_match('/t(?!he)/i')) {
    $result = 'yes';
}

在每个文件中，表格都是这种模式

GSM123445_samples_table.txt
GSM129995_samples_table.txt

我有一个数据框 clinical_data ，其中包含我想要的所有文件，就是这种模式

Identifier     VALUE
     10001   0.12323
     10002   0.11535

我想创建一个应该喜欢的数据框

                     Data.File      Samples     OS.event
1  GSM123445_samples_table.txt    GSM123445            0
2  GSM129995_samples_table.txt    GSM129995            0
3  GSM129999_samples_table.txt    GSM129999            1
4  GSM130095_samples_table.txt    GSM130095            1

这是我的代码

     Identifier  GSM123445  GSM129995  GSM129999  GSM130095
 1       10001     0.12323    0.14523    0.22387    0.56233
 2       10002     0.11535    0.39048    0.23437   -0.12323
 3       10006     0.12323    0.35634    0.12237   -0.12889
 4       10008     0.11535    0.23454    0.21227    0.90098

我的结果是这个

library(dplyr)
setwd(.../GSE11000)
file_list <- clinical_data[, 1] # create a list that include Data.File
for (file in file_list){
  if (!exists("dataset")){     # if dataset not exists, create one
     dataset <- read.table(file, header=TRUE, sep="\t") #read txt file from folder
     x <- unlist(strsplit(file, "_"))[1] # extract the GSMxxxxxx from the name of files
     dataset <- rename(dataset, x = VALUE) # rename the column
  }     
  else {
     temp_dataset <- read.table(file, header=TRUE, sep="\t") # read file
     x <- unlist(strsplit(file, "_"))[1]
     temp_dataset <- rename(temp_dataset, x = VALUE)    
     dataset<-left_join(dataset, temp_dataset, "Reporter.Identifier")
     rm(temp_dataset)
  }
}

这是因为重命名部分无效。

任何人都知道如何解决这个问题？任何人都可以提高我的代码效率吗？

如果您可以告诉我如何使用bioconductor以便我可以处理这些数据，我将不胜感激。

Answer 1

与@jdobres类似，但使用dplyr（和spread）：

首先，创建一些示例数据文件：

set.seed(42)
for (fname in sprintf("GSM%s_samples_table.txt", sample(10000, size = 4))) {
  write.table(data.frame(Identifier = 10001:10004, VALUE = runif(4)),
              file = fname, row.names = FALSE)
}
file_list <- list.files(pattern = "GSM.*")
file_list
# [1] "GSM2861_samples_table.txt" "GSM8302_samples_table.txt"
# [3] "GSM9149_samples_table.txt" "GSM9370_samples_table.txt"
read.table(file_list[1], skip = 1, col.names = c("Identifier", "VALUE"))
#   Identifier     VALUE
# 1      10001 0.9346722
# 2      10002 0.2554288
# 3      10003 0.4622928
# 4      10004 0.9400145

现在处理：

library(dplyr)
library(tidyr)
mapply(function(fname, varname)
           cbind.data.frame(Samples = varname,
                            read.table(fname, skip = 1, col.names = c("Identifier", "VALUE")),
                            stringsAsFactors = FALSE),
       file_list, gsub("_.*", "", file_list), SIMPLIFY = FALSE) %>%
  bind_rows() %>%
  spread(Samples, VALUE)
#   Identifier   GSM2861   GSM8302   GSM9149   GSM9370
# 1      10001 0.9346722 0.9782264 0.6417455 0.6569923
# 2      10002 0.2554288 0.1174874 0.5190959 0.7050648
# 3      10003 0.4622928 0.4749971 0.7365883 0.4577418
# 4      10004 0.9400145 0.5603327 0.1346666 0.7191123

Answer 2

很难说这是否有效，因为你的例子不具备可重复性，但这就是我如何处理它。

首先，将所有数据文件读入一个大型数据框，创建一个名为＆＃34; sample＆＃34;的额外列。这将保留您的样品标签。

resolve()

然后使用library(plyr) df <- ddply(clinical_data, .(Data.File), function(x) { data.this <- read.table(x$Data.File, header=TRUE, sep="\t") data.this$sample <- x$Samples return(data.this) })函数为每个＆＃34;样本＆＃34;创建一个新列。使用＆＃34; VALUE＆＃34;中的值列。

tidyr::spread

如何使用dplyr :: rename（）将字符放入数据框？

2 个答案: