如何从网页中正确提取具有两个标题行的表?

时间:2017-03-15 14:22:53

标签: r xml rvest

我想在targetScan results的网页中下载表格。预期的输出看起来像这样(通过将内容复制并粘贴到excel,手动输入colnames,并导出为txt :()来创建:

# > head(results)
#        Target.gene  Representative.3UTR         X3UTR.expression.profile sites.total sites.8mer sites.7mer.m8
# 1 si:ch73-269m14.4 ENSDARG00000086612.1           72h,Adult,Brain,Testis           7          0             7
# 2          eef2a.1 ENSDARG00000042094.1 24h,72h,Adult,Brain,Ovary,Testis           3          1             2
# 3 WFIKKN2 (2 of 2) ENSDARG00000059139.1           72h,Adult,Brain,Testis           3          1             2
# 4  si:ch211-59h6.1 ENSDARG00000013022.1 24h,72h,Adult,Brain,Ovary,Testis           3          1             2
# 5    RSF1 (3 of 3) ENSDARG00000074737.1                        24h,Adult           6          1             4
# 6           wnt2ba ENSDARG00000005050.1           72h,Adult,Ovary,Testis           3          1             2
#
#   sites.7mer1A    rep.mirna context..score         link
# 1            0 dre-miR-430b          -1.63 Sites in UTR
# 2            0 dre-miR-430b           -0.8 Sites in UTR
# 3            0 dre-miR-430b          -0.76 Sites in UTR
# 4            0 dre-miR-430b          -0.68 Sites in UTR
# 5            1 dre-miR-430b          -0.67 Sites in UTR
# 6            0 dre-miR-430a          -0.66 Sites in UTR

我尝试使用rvestXML将此表从html导入到R但失败了。

rvest尝试:

我通过右键单击 - >提取了表格节点的xpath。用铬检查。然后我尝试用下面的代码刮掉表,我得到的是一个只有标题的表:

library(rvest)
ts.url <- "http://www.targetscan.org/cgi-bin/targetscan/fish_62/targetscan.cgi?gid=&mir_sc=miR-430&mir_nc=&mirg="
ts.page <- read_html(ts.url)
results <- html_table(html_node(ts.page, xpath='//*[@id="restable"]'), fill = T)

# > results
#   Target gene Representative 3' UTR 3' UTR expression profile All sites All sites All sites All sites
# 1 Target gene Representative 3' UTR 3' UTR expression profile     total All sites All sites All sites
#   Repre- sentative miRNA Total context+ score Links to sites in UTRs
# 1                   8mer              7mer-m8                7mer-1A

XML尝试:

library(XML)
ts.url <- "http://www.targetscan.org/cgi-bin/targetscan/fish_62/targetscan.cgi?gid=&mir_sc=miR-430&mir_nc=&mirg="
tables <- readHTMLTable(ts.url)

# > tables[3]
# $restable
#      V1   V2      V3      V4
# 1 total 8mer 7mer-m8 7mer-1A

# another solution from a post on SO:
library(RCurl)
ts.url <- "http://www.targetscan.org/cgi-bin/targetscan/fish_62/targetscan.cgi?gid=&mir_sc=miR-430&mir_nc=&mirg="
webpage <- getURL(ts.url)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)

pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)

tablehead <- xpathSApply(pagetree, '//*/table[@id="restable"]/tr/th', xmlValue)
results <- xpathSApply(pagetree, '//*/table[@id="restable"]/tr/td', xmlValue)

# > tablehead
#  [1] "Target gene"               "Representative 3' UTR"     "3' UTR expression profile" "All sites"                
#  [5] "Repre- sentative miRNA"    "Total context+ score"      "Links to sites in UTRs"    "total"                    
#  [9] "8mer"                      "7mer-m8"                   "7mer-1A"                  
# > results
# list()

我的问题是如何从该网页正确导入包含rvestXML的表格? (只要提取了表格内容,标题就不重要了。)

2 个答案:

答案 0 :(得分:2)

这些行有一个结束但没有开始<tr>标记,因此您可以添加它们,然后readHTMLTable应该正常工作

x <- readLines(ts.url)
x <- gsub("^<td>", "<tr><td>", x)
y <- readHTMLTable(x, which=3, skip.rows=1)
dim(y)
[1] 4184   10
head(y)
                V1                   V2                               V3 V4 V5 V6 V7
1 si:ch73-269m14.4 ENSDARG00000086612.1           72h,Adult,Brain,Testis  7  0  7  0
2          eef2a.1 ENSDARG00000042094.1 24h,72h,Adult,Brain,Ovary,Testis  3  1  2  0
3 WFIKKN2 (2 of 2) ENSDARG00000059139.1           72h,Adult,Brain,Testis  3  1  2  0
4  si:ch211-59h6.1 ENSDARG00000013022.1 24h,72h,Adult,Brain,Ovary,Testis  3  1  2  0
5    RSF1 (3 of 3) ENSDARG00000074737.1                        24h,Adult  6  1  4  1
6           wnt2ba ENSDARG00000005050.1           72h,Adult,Ovary,Testis  3  1  2  0
            V8    V9          V10
1 dre-miR-430b -1.63 Sites in UTR
2 dre-miR-430b -0.80 Sites in UTR
3 dre-miR-430b -0.76 Sites in UTR
4 dre-miR-430b -0.68 Sites in UTR
5 dre-miR-430b -0.67 Sites in UTR
6 dre-miR-430a -0.66 Sites in UTR

答案 1 :(得分:1)

我认为这会让你非常接近你想要的东西。没有人喜欢标题不均匀的表格,哈哈。

library(rvest)

url <- "http://www.targetscan.org/cgi-bin/targetscan/fish_62/targetscan.cgi?gid=&mir_sc=miR-430&mir_nc=&mirg="


dat <- read_html(url)

# extract the "td" elements from table
x <- html_node(dat, xpath = '//*[@id="restable"]') %>%
    html_nodes("td") %>%
    html_text()


# put these in a character matrix -- be careful manually setting number of columns
my_matrix <- matrix(x, ncol = 10, byrow = T)

# put these in a dataframe if you prefer that
my_df <- data.frame(my_matrix, stringsAsFactors = F)

编辑:添加奖金,捕获表格中链接的网址

library(rvest)

url <- "http://www.targetscan.org/cgi-bin/targetscan/fish_62/targetscan.cgi?gid=&mir_sc=miR-430&mir_nc=&mirg="


dat <- read_html(url)

# extract the "td" elements from table
x <- html_node(dat, xpath = '//*[@id="restable"]') %>%
    html_nodes("td") %>%
    html_text()

    links <- html_node(dat, xpath = '//*[@id="restable"]') %>%
        html_nodes("a") %>%
        html_attr('href')

    my_links_matrix <- matrix(links, ncol = 2, byrow = T)
    my_links_df <- data.frame(my_links_matrix)

# put these in a character matrix -- be careful manually setting number of columns
my_matrix <- matrix(x, ncol = 10, byrow = T)

# put these in a dataframe if you prefer that
my_df <- data.frame(my_matrix, stringsAsFactors = F)


my_df <- cbind(my_df, my_links_df)

可能希望检查链接是否与表/数据框中的其他值对齐。