我想在targetScan results的网页中下载表格。预期的输出看起来像这样(通过将内容复制并粘贴到excel,手动输入colnames,并导出为txt :()来创建:
# > head(results)
# Target.gene Representative.3UTR X3UTR.expression.profile sites.total sites.8mer sites.7mer.m8
# 1 si:ch73-269m14.4 ENSDARG00000086612.1 72h,Adult,Brain,Testis 7 0 7
# 2 eef2a.1 ENSDARG00000042094.1 24h,72h,Adult,Brain,Ovary,Testis 3 1 2
# 3 WFIKKN2 (2 of 2) ENSDARG00000059139.1 72h,Adult,Brain,Testis 3 1 2
# 4 si:ch211-59h6.1 ENSDARG00000013022.1 24h,72h,Adult,Brain,Ovary,Testis 3 1 2
# 5 RSF1 (3 of 3) ENSDARG00000074737.1 24h,Adult 6 1 4
# 6 wnt2ba ENSDARG00000005050.1 72h,Adult,Ovary,Testis 3 1 2
#
# sites.7mer1A rep.mirna context..score link
# 1 0 dre-miR-430b -1.63 Sites in UTR
# 2 0 dre-miR-430b -0.8 Sites in UTR
# 3 0 dre-miR-430b -0.76 Sites in UTR
# 4 0 dre-miR-430b -0.68 Sites in UTR
# 5 1 dre-miR-430b -0.67 Sites in UTR
# 6 0 dre-miR-430a -0.66 Sites in UTR
我尝试使用rvest
或XML
将此表从html导入到R但失败了。
rvest
尝试:我通过右键单击 - >提取了表格节点的xpath。用铬检查。然后我尝试用下面的代码刮掉表,我得到的是一个只有标题的表:
library(rvest)
ts.url <- "http://www.targetscan.org/cgi-bin/targetscan/fish_62/targetscan.cgi?gid=&mir_sc=miR-430&mir_nc=&mirg="
ts.page <- read_html(ts.url)
results <- html_table(html_node(ts.page, xpath='//*[@id="restable"]'), fill = T)
# > results
# Target gene Representative 3' UTR 3' UTR expression profile All sites All sites All sites All sites
# 1 Target gene Representative 3' UTR 3' UTR expression profile total All sites All sites All sites
# Repre- sentative miRNA Total context+ score Links to sites in UTRs
# 1 8mer 7mer-m8 7mer-1A
XML
尝试:library(XML)
ts.url <- "http://www.targetscan.org/cgi-bin/targetscan/fish_62/targetscan.cgi?gid=&mir_sc=miR-430&mir_nc=&mirg="
tables <- readHTMLTable(ts.url)
# > tables[3]
# $restable
# V1 V2 V3 V4
# 1 total 8mer 7mer-m8 7mer-1A
# another solution from a post on SO:
library(RCurl)
ts.url <- "http://www.targetscan.org/cgi-bin/targetscan/fish_62/targetscan.cgi?gid=&mir_sc=miR-430&mir_nc=&mirg="
webpage <- getURL(ts.url)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
tablehead <- xpathSApply(pagetree, '//*/table[@id="restable"]/tr/th', xmlValue)
results <- xpathSApply(pagetree, '//*/table[@id="restable"]/tr/td', xmlValue)
# > tablehead
# [1] "Target gene" "Representative 3' UTR" "3' UTR expression profile" "All sites"
# [5] "Repre- sentative miRNA" "Total context+ score" "Links to sites in UTRs" "total"
# [9] "8mer" "7mer-m8" "7mer-1A"
# > results
# list()
我的问题是如何从该网页正确导入包含rvest
或XML
的表格? (只要提取了表格内容,标题就不重要了。)
答案 0 :(得分:2)
这些行有一个结束但没有开始<tr>
标记,因此您可以添加它们,然后readHTMLTable
应该正常工作
x <- readLines(ts.url)
x <- gsub("^<td>", "<tr><td>", x)
y <- readHTMLTable(x, which=3, skip.rows=1)
dim(y)
[1] 4184 10
head(y)
V1 V2 V3 V4 V5 V6 V7
1 si:ch73-269m14.4 ENSDARG00000086612.1 72h,Adult,Brain,Testis 7 0 7 0
2 eef2a.1 ENSDARG00000042094.1 24h,72h,Adult,Brain,Ovary,Testis 3 1 2 0
3 WFIKKN2 (2 of 2) ENSDARG00000059139.1 72h,Adult,Brain,Testis 3 1 2 0
4 si:ch211-59h6.1 ENSDARG00000013022.1 24h,72h,Adult,Brain,Ovary,Testis 3 1 2 0
5 RSF1 (3 of 3) ENSDARG00000074737.1 24h,Adult 6 1 4 1
6 wnt2ba ENSDARG00000005050.1 72h,Adult,Ovary,Testis 3 1 2 0
V8 V9 V10
1 dre-miR-430b -1.63 Sites in UTR
2 dre-miR-430b -0.80 Sites in UTR
3 dre-miR-430b -0.76 Sites in UTR
4 dre-miR-430b -0.68 Sites in UTR
5 dre-miR-430b -0.67 Sites in UTR
6 dre-miR-430a -0.66 Sites in UTR
答案 1 :(得分:1)
我认为这会让你非常接近你想要的东西。没有人喜欢标题不均匀的表格,哈哈。
library(rvest)
url <- "http://www.targetscan.org/cgi-bin/targetscan/fish_62/targetscan.cgi?gid=&mir_sc=miR-430&mir_nc=&mirg="
dat <- read_html(url)
# extract the "td" elements from table
x <- html_node(dat, xpath = '//*[@id="restable"]') %>%
html_nodes("td") %>%
html_text()
# put these in a character matrix -- be careful manually setting number of columns
my_matrix <- matrix(x, ncol = 10, byrow = T)
# put these in a dataframe if you prefer that
my_df <- data.frame(my_matrix, stringsAsFactors = F)
library(rvest)
url <- "http://www.targetscan.org/cgi-bin/targetscan/fish_62/targetscan.cgi?gid=&mir_sc=miR-430&mir_nc=&mirg="
dat <- read_html(url)
# extract the "td" elements from table
x <- html_node(dat, xpath = '//*[@id="restable"]') %>%
html_nodes("td") %>%
html_text()
links <- html_node(dat, xpath = '//*[@id="restable"]') %>%
html_nodes("a") %>%
html_attr('href')
my_links_matrix <- matrix(links, ncol = 2, byrow = T)
my_links_df <- data.frame(my_links_matrix)
# put these in a character matrix -- be careful manually setting number of columns
my_matrix <- matrix(x, ncol = 10, byrow = T)
# put these in a dataframe if you prefer that
my_df <- data.frame(my_matrix, stringsAsFactors = F)
my_df <- cbind(my_df, my_links_df)
可能希望检查链接是否与表/数据框中的其他值对齐。