难以从FTP下载解析文本文件

时间:2015-11-08 20:49:19

标签: r text import ftp tab-delimited-text

t2=url("ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE1nnn/GSE1000/matrix/", open = "", blocking = TRUE, encoding = getOption("encoding"))
t2
t2=t2[-2]
isOpen(t2)
t2= readLines(t2, n = 4200)
t2[4010]
summary(t2)

使用上面的代码我可以获取ftp文件,但我无法进行任何进一步的绘图? 我能够看到数据。

但是,我无法安排在桌子上。 任何人都可以帮忙

2 个答案:

答案 0 :(得分:1)

以下代码将毫无问题地读取数据:

dta <- read.csv("ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid225/U00096.ptt", 
header = TRUE, skip = 2, sep = "\t")

我猜你是在追踪数据框:

> head(dta)
    Location Strand Length     PID Gene Synonym Code COG                                                Product
1   190..255      +     21 1786182 thrL   b0001    -   -                              thr operon leader peptide
2  337..2799      +    820 1786183 thrA   b0002    -   -  Bifunctional aspartokinase/homoserine dehydrogenase 1
3 2801..3733      +    310 1786184 thrB   b0003    -   -                                      homoserine kinase
4 3734..5020      +    428 1786185 thrC   b0004    -   -                                   L-threonine synthase
5 5234..5530      +     98 1786186 yaaX   b0005    -   -            DUF2502 family putative periplasmic protein
6 5683..6459      -    258 1786187 yaaA   b0006    -   - peroxide resistance protein, lowers intracellular iron

为了简化导入,我跳过前两行:

Escherichia coli str. K-12 substr. MG1655, complete genome. - 1..4641652
4140 proteins
Location    Strand  Length  PID Gene    Synonym Code    COG Product
190..255    +   21  1786182 thrL    b0001   -   -   thr operon leader peptide

如果您想阅读整个文件,我建议您查看this post。您可以考虑阅读整个内容并分别访问前两行,然后将其余内容导入数据框。

答案 1 :(得分:0)

测试我的评论:

read.delim( text=c("4350031..4351662\t-\t543\t1790567\tdcuS\tb4125\t-\t-\tsensory histidine kinase in two-component regulatory system with DcuR, regulator of anaerobic fumarate respiration"   ,                                                                                               
"4351843..4352073\t+\t76\t1790568\tyjdI\tb4126\t-\t-\tputative 4Fe-4S mono-cluster protein" ), header=FALSE)
#---------
                V1 V2  V3      V4   V5    V6 V7 V8
1 4350031..4351662  - 543 1790567 dcuS b4125  -  -
2 4351843..4352073  +  76 1790568 yjdI b4126  -  -
                                                                                                                  V9
1 sensory histidine kinase in two-component regulatory system with DcuR, regulator of anaerobic fumarate respiration
2                                                                               putative 4Fe-4S mono-cluster protein

我怀疑第一行实际上是一个标题,因为它似乎是我在该FTP站点中查看的README文件中的模式,因此您可能会删除header=FALSE。这些只是[3883-3884]。