R

时间:2018-02-27 18:04:02

标签: r web-scraping rvest httr

我需要对以下网页进行网页抓取。我需要提交一些具有特定价值的表格。提交表单后,我需要在data.frame中将数据导入R(链接表"以文本文件"查看结果)。我尝试使用以下代码进行提交,但我没有得到结果:

library(rvest)
library(httr)

POST(
  url = "http://tempest.wellesley.edu/~btjaden/TargetRNA2/advanced.html",
  encode = "form",
  body=list(
    `text` = "Escherichia coli str. K-12 substr. MG1655",
    `sequence` = ">RyhB GCGATCAGGAAGACCCTCGCGGAGAACCTGAAAGCACGACATTGCTCACATTGCTTCCAGTATTACTTAGCCAGCCGGGTGCTGGCTTTT",
    `sRNA_subregions` = "on",
    `window` = "13",
    `before` = "80",
    `after` = "20",
    `seed` = "7",
    `interaction_region` = "20",
    `candidate_targets` = "",
    `mRNA_accessibility` = "on",
    `sigle_target` = "",
    `pvalue`= "0.05",
    `max_interactions`="400"
  ),
  verbose()
) -> res
content(res, as="parsed")

我知道有一个中间页面 我认为在加载结果http://tempest.wellesley.edu/~btjaden/cgi-bin/processRequest2.cgi之前有一个中间页面,我不知道这个中间页面的参数。所以我无法得到结果。我想得到这个表(http://tempest.wellesley.edu/~btjaden/cgi-bin/targetRNA2.cgi?t1519754493.26):

Rank    Gene    Synonym Energy  Pvalue  sRNA_start  sRNA_stop   mRNA_start  mRNA_stop
1   sdhD    b0722   -12.98  0.004       28          42          -34         -20
2   ascG    b2714   -12.65  0.005       52          65          8           20
3   ygjH    b3074   -12.24  0.006       45          59          -8          6
4   sodB    b1656   -11.43  0.011       37          50          -7          6
5   acnA    b1276   -11.14  0.013       33          48          -6          9
6   srlQ    b2708   -10.79  0.015       34          48          -6          8
7   cirA    b2155   -10.71  0.016       40          57          -58         -40
8   nirB    b3365   -10.51  0.018       37          55          -6          13
9   djlB    b0646   -10.41  0.019       53          63          9           19
10  shiA    b1981   -9.96   0.024       43          58          -63         -47
11  yhhN    b3468   -9.78   0.026       50          62          -61         -49
12  ybbP    b0496   -9.45   0.030       48          59          -7          4
13  ssuD    b0935   -9.43   0.031       50          62          -19         -7
14  cysE    b3607   -8.99   0.037       33          49          -8          10
15  insH1   b2030   -8.86   0.039       29          39          -75         -65
16  hscA    b2526   -8.82   0.040       52          66          -20         -5
17  yciS    b1279   -8.69   0.043       45          59          -10         5
18  dhaL    b1199   -8.63   0.044       37          50          -8          6
19  nuoA    b2288   -8.6    0.044       42          59          -8          8
20  narG    b1224   -8.47   0.047       36          47          -51         -40
21  yraK    b3145   -8.37   0.049       27          41          -80         -68

1 个答案:

答案 0 :(得分:0)

POST应该转到processRequest2.cgi端点:

library(rvest)
library(httr)

POST(
  url = "http://tempest.wellesley.edu/~btjaden/cgi-bin/processRequest2.cgi",
  encode = "form",
  body=list(
    `text` = "Escherichia coli str. K-12 substr. MG1655",
    `sequence` = ">RyhB GCGATCAGGAAGACCCTCGCGGAGAACCTGAAAGCACGACATTGCTCACATTGCTTCCAGTATTACTTAGCCAGCCGGGTGCTGGCTTTT",
    `sRNA_subregions` = "on",
    `window` = "13",
    `before` = "80",
    `after` = "20",
    `seed` = "7",
    `interaction_region` = "20",
    `candidate_targets` = "",
    `mRNA_accessibility` = "on",
    `sigle_target` = "",
    `pvalue`= "0.05",
    `max_interactions`="400"
  ),
  verbose()
) -> res

之后,您可以查找最终将您重定向到的URL:

content(res, as="parsed") %>% 
  html_node(xpath=".//meta[@http-equiv]") %>% 
  html_attr("content") %>% 
  strsplit("=") %>% 
  .[[1]] %>% 
  .[2] %>% 
  sprintf("http://tempest.wellesley.edu/~btjaden/cgi-bin/%s", .) -> target_url

该网站说等了6秒:

Sys.sleep(6)

然后你可以得到数据:

pg <- read_html(target_url)

html_nodes(pg, "table")
## {xml_nodeset (89)}
##  [1] <table><tr>\n<td align="left"><code>GCGATCAGGAAGACCCTCGCGGAGAACCTGAAAGCACGAC< ...
##  [2] <table width="800">\n<tr>\n<th align="center">Rank</th>\n  <th align="center" ...
##  [3] <table width="355"><tr>\n<td align="left">1</td>\n      <td width="90%">\n    ...
##  [4] <table width="100%">\n<tr><td></td></tr>\n<tr><td width="100%" bgcolor="white ...
##  [5] <table width="355"><tr>\n<td width="32%"> </td>\n      <td bgcolor="1E90FF">  ...
##  [6] <table width="355"><tr>\n<td width="56%"> </td>\n      <td bgcolor="1E90FF">  ...
##  [7] <table width="355"><tr>\n<td width="49%"> </td>\n      <td bgcolor="1E90FF">  ...
##  [8] <table width="355"><tr>\n<td width="41%"> </td>\n      <td bgcolor="1E90FF">  ...
##  [9] <table width="355"><tr>\n<td width="37%"> </td>\n      <td bgcolor="1E90FF">  ...
## [10] <table width="355"><tr>\n<td width="38%"> </td>\n      <td bgcolor="1E90FF">  ...
## [11] <table width="355"><tr>\n<td width="44%"> </td>\n      <td bgcolor="1E90FF">  ...
## [12] <table width="355"><tr>\n<td width="41%"> </td>\n      <td bgcolor="1E90FF">  ...
## [13] <table width="355"><tr>\n<td width="57%"> </td>\n      <td bgcolor="1E90FF">  ...
## [14] <table width="355"><tr>\n<td width="47%"> </td>\n      <td bgcolor="1E90FF">  ...
## [15] <table width="355"><tr>\n<td width="54%"> </td>\n      <td bgcolor="1E90FF">  ...
## [16] <table width="355"><tr>\n<td width="52%"> </td>\n      <td bgcolor="1E90FF">  ...
## [17] <table width="355"><tr>\n<td width="54%"> </td>\n      <td bgcolor="1E90FF">  ...
## [18] <table width="355"><tr>\n<td width="37%"> </td>\n      <td bgcolor="1E90FF">  ...
## [19] <table width="355"><tr>\n<td width="33%"> </td>\n      <td bgcolor="1E90FF">  ...
## [20] <table width="355"><tr>\n<td width="56%"> </td>\n      <td bgcolor="1E90FF">  ...
## ...