如何将所有信息提取到xls或txt文件中

时间:2015-03-20 15:09:26

标签: r bioinformatics bioconductor

我想在xls中提取给定样本的所有信息 为了例如

library(GEOquery)
gpl <- getGEO("GPL16791")
data <- gpl@header$sample_id
gps <- getGEO(data[1])
str(gps)

如下所示

Formal class 'GSM' [package "GEOquery"] with 2 slots
  ..@ dataTable:Formal class 'GEODataTable' [package "GEOquery"] with 2 slots
  .. .. ..@ columns:'data.frame':   0 obs. of  0 variables
  .. .. ..@ table  :'data.frame':   0 obs. of  0 variables
  ..@ header   :List of 36
  .. ..$ channel_count          : chr "1"
  .. ..$ characteristics_ch1    : chr "cell type: Induced endothelial cells from cultured foreskin fibroblast cells (Stegment)"
  .. ..$ contact_address        : chr "3333 Burnet Ave"
  .. ..$ contact_city           : chr "Cincinnati"
  .. ..$ contact_country        : chr "USA"
  .. ..$ contact_department     : chr "Biomedical Informatics"
  .. ..$ contact_email          : chr "Rebekah.Karns@cchmc.org"
  .. ..$ contact_institute      : chr "Cincinnati Children's Hospital Medical Center"
  .. ..$ contact_laboratory     : chr "Bruce Aronow, PhD"
  .. ..$ contact_name           : chr "Rebekah,,Karns"
  .. ..$ contact_state          : chr "OH"
  .. ..$ contact_zip/postal_code: chr "45276"
  .. ..$ data_processing        : chr [1:4] "Trimmed sequences were generated as fastq outputs and analyzed based on the TopHat/Cufflinks pipeline based on reference annota"| __truncated__ "Gene-level expression was normalized and baselined to the 80th percentile of that sample's overall expression in GeneSpring v7."| __truncated__ "Genome_build: GRCh37/hg19" "Supplementary_files_format_and_content: Each sample has a corresponding .txt file with normalized FPKM"
  .. ..$ data_row_count         : chr "0"
  .. ..$ description            : chr "iECa"
  .. ..$ extract_protocol_ch1   : chr [1:2] "Using RNeasy Mini Kit (Qiagen), total RNA was extracted and quantitative polymerase chain reaction was performed using Taqman g"| __truncated__ "RNA-Seq–based expression analysis was carried out using RNA samples converted into individual cDNA libraries using Illumina (Sa"| __truncated__
  .. ..$ geo_accession          : chr "GSM1098572"
  .. ..$ growth_protocol_ch1    : chr "Fibroblasts were treated with Poly I:C (30ng/ml) and the medium changed to DMEM with 7.5% FBS and 7.5% knockout serum replaceme"| __truncated__
  .. ..$ instrument_model       : chr "Illumina HiSeq 2500"
  .. ..$ last_update_date       : chr "Apr 18 2013"
  .. ..$ library_selection      : chr "cDNA"
  .. ..$ library_source         : chr "transcriptomic"
  .. ..$ library_strategy       : chr "RNA-Seq"
  .. ..$ molecule_ch1           : chr "total RNA"
  .. ..$ organism_ch1           : chr "Homo sapiens"
  .. ..$ platform_id            : chr "GPL16791"
  .. ..$ relation               : chr [1:2] "SRA: http://www.ncbi.nlm.nih.gov/sra?term=SRX249507" "BioSample: http://www.ncbi.nlm.nih.gov/biosample/SAMN01978505"
  .. ..$ series_id              : chr "GSE45176"
  .. ..$ source_name_ch1        : chr "Induced endothelial cell"
  .. ..$ status                 : chr "Public on Apr 14 2013"
  .. ..$ submission_date        : chr "Mar 14 2013"
  .. ..$ supplementary_file_1   : chr "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1098nnn/GSM1098572/suppl/GSM1098572_iECa_Processed.txt.gz"
  .. ..$ supplementary_file_2   : chr "ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRX/SRX249/SRX249507"
  .. ..$ taxid_ch1              : chr "9606"
  .. ..$ title                  : chr "iEC: Rep1"
  .. ..$ type                   : chr "SRA"

我希望输出为txt或xls,每行是“data”中的一个样本,并且包含列中的所有这些信息,例如

   channel_count    characteristics_ch1                contact_address .....
1   "1"           "cell type: Induced endothelial cells       "3333 Burnet Ave"
2
.
.
.
until length of data

1 个答案:

答案 0 :(得分:0)

当标题缺少变量时,此函数现在也可以使用。我知道循环不是很优雅,但它在我的测试中起作用。

gpl <- getGEO("GPL18448")
data <- gpl@header$sample_id

getGpsInfo <- function(x){
      gps <- getGEO(x)
      gps <- unlist(gps@header)
      gps <- data.frame(gps, stringsAsFactors = F)
      gps <- t(gps)
      # if gps has multiple rows keep only unique ones
      gps <- unique(gps)
      return(gps)
}
dat <- lapply(data, FUN = getGpsInfo)
# dat is a list with different numbers of elements per entry
varnames <- unique(unlist(lapply(dat, colnames)))
dat2 <- data.frame(matrix(NA, nrow = length(dat), ncol = length(varnames)))
colnames(dat2) <- varnames
for(i in seq(along=dat)){
      for(j in seq_along(varnames)){
            element <- which(colnames(dat[[i]]) == varnames[j])
            replacement <- dat[[i]][element]
            if (length(replacement) > 0){
                  dat2[i,j] <- replacement
            }
      }
}
write.table(dat2, file = "dat2.csv", row.names = T, sep = ";")