我需要对以下网页进行网页抓取。我需要提交一些具有特定价值的表格。提交表单后,我需要在data.frame中将数据导入R(链接表"以文本文件"查看结果)。我尝试使用以下代码进行提交,但我没有得到结果:
library(rvest)
library(httr)
POST(
url = "http://tempest.wellesley.edu/~btjaden/TargetRNA2/advanced.html",
encode = "form",
body=list(
`text` = "Escherichia coli str. K-12 substr. MG1655",
`sequence` = ">RyhB GCGATCAGGAAGACCCTCGCGGAGAACCTGAAAGCACGACATTGCTCACATTGCTTCCAGTATTACTTAGCCAGCCGGGTGCTGGCTTTT",
`sRNA_subregions` = "on",
`window` = "13",
`before` = "80",
`after` = "20",
`seed` = "7",
`interaction_region` = "20",
`candidate_targets` = "",
`mRNA_accessibility` = "on",
`sigle_target` = "",
`pvalue`= "0.05",
`max_interactions`="400"
),
verbose()
) -> res
content(res, as="parsed")
我知道有一个中间页面 我认为在加载结果http://tempest.wellesley.edu/~btjaden/cgi-bin/processRequest2.cgi之前有一个中间页面,我不知道这个中间页面的参数。所以我无法得到结果。我想得到这个表(http://tempest.wellesley.edu/~btjaden/cgi-bin/targetRNA2.cgi?t1519754493.26):
Rank Gene Synonym Energy Pvalue sRNA_start sRNA_stop mRNA_start mRNA_stop
1 sdhD b0722 -12.98 0.004 28 42 -34 -20
2 ascG b2714 -12.65 0.005 52 65 8 20
3 ygjH b3074 -12.24 0.006 45 59 -8 6
4 sodB b1656 -11.43 0.011 37 50 -7 6
5 acnA b1276 -11.14 0.013 33 48 -6 9
6 srlQ b2708 -10.79 0.015 34 48 -6 8
7 cirA b2155 -10.71 0.016 40 57 -58 -40
8 nirB b3365 -10.51 0.018 37 55 -6 13
9 djlB b0646 -10.41 0.019 53 63 9 19
10 shiA b1981 -9.96 0.024 43 58 -63 -47
11 yhhN b3468 -9.78 0.026 50 62 -61 -49
12 ybbP b0496 -9.45 0.030 48 59 -7 4
13 ssuD b0935 -9.43 0.031 50 62 -19 -7
14 cysE b3607 -8.99 0.037 33 49 -8 10
15 insH1 b2030 -8.86 0.039 29 39 -75 -65
16 hscA b2526 -8.82 0.040 52 66 -20 -5
17 yciS b1279 -8.69 0.043 45 59 -10 5
18 dhaL b1199 -8.63 0.044 37 50 -8 6
19 nuoA b2288 -8.6 0.044 42 59 -8 8
20 narG b1224 -8.47 0.047 36 47 -51 -40
21 yraK b3145 -8.37 0.049 27 41 -80 -68
答案 0 :(得分:0)
POST
应该转到processRequest2.cgi
端点:
library(rvest)
library(httr)
POST(
url = "http://tempest.wellesley.edu/~btjaden/cgi-bin/processRequest2.cgi",
encode = "form",
body=list(
`text` = "Escherichia coli str. K-12 substr. MG1655",
`sequence` = ">RyhB GCGATCAGGAAGACCCTCGCGGAGAACCTGAAAGCACGACATTGCTCACATTGCTTCCAGTATTACTTAGCCAGCCGGGTGCTGGCTTTT",
`sRNA_subregions` = "on",
`window` = "13",
`before` = "80",
`after` = "20",
`seed` = "7",
`interaction_region` = "20",
`candidate_targets` = "",
`mRNA_accessibility` = "on",
`sigle_target` = "",
`pvalue`= "0.05",
`max_interactions`="400"
),
verbose()
) -> res
之后,您可以查找最终将您重定向到的URL:
content(res, as="parsed") %>%
html_node(xpath=".//meta[@http-equiv]") %>%
html_attr("content") %>%
strsplit("=") %>%
.[[1]] %>%
.[2] %>%
sprintf("http://tempest.wellesley.edu/~btjaden/cgi-bin/%s", .) -> target_url
该网站说等了6秒:
Sys.sleep(6)
然后你可以得到数据:
pg <- read_html(target_url)
html_nodes(pg, "table")
## {xml_nodeset (89)}
## [1] <table><tr>\n<td align="left"><code>GCGATCAGGAAGACCCTCGCGGAGAACCTGAAAGCACGAC< ...
## [2] <table width="800">\n<tr>\n<th align="center">Rank</th>\n <th align="center" ...
## [3] <table width="355"><tr>\n<td align="left">1</td>\n <td width="90%">\n ...
## [4] <table width="100%">\n<tr><td></td></tr>\n<tr><td width="100%" bgcolor="white ...
## [5] <table width="355"><tr>\n<td width="32%"> </td>\n <td bgcolor="1E90FF"> ...
## [6] <table width="355"><tr>\n<td width="56%"> </td>\n <td bgcolor="1E90FF"> ...
## [7] <table width="355"><tr>\n<td width="49%"> </td>\n <td bgcolor="1E90FF"> ...
## [8] <table width="355"><tr>\n<td width="41%"> </td>\n <td bgcolor="1E90FF"> ...
## [9] <table width="355"><tr>\n<td width="37%"> </td>\n <td bgcolor="1E90FF"> ...
## [10] <table width="355"><tr>\n<td width="38%"> </td>\n <td bgcolor="1E90FF"> ...
## [11] <table width="355"><tr>\n<td width="44%"> </td>\n <td bgcolor="1E90FF"> ...
## [12] <table width="355"><tr>\n<td width="41%"> </td>\n <td bgcolor="1E90FF"> ...
## [13] <table width="355"><tr>\n<td width="57%"> </td>\n <td bgcolor="1E90FF"> ...
## [14] <table width="355"><tr>\n<td width="47%"> </td>\n <td bgcolor="1E90FF"> ...
## [15] <table width="355"><tr>\n<td width="54%"> </td>\n <td bgcolor="1E90FF"> ...
## [16] <table width="355"><tr>\n<td width="52%"> </td>\n <td bgcolor="1E90FF"> ...
## [17] <table width="355"><tr>\n<td width="54%"> </td>\n <td bgcolor="1E90FF"> ...
## [18] <table width="355"><tr>\n<td width="37%"> </td>\n <td bgcolor="1E90FF"> ...
## [19] <table width="355"><tr>\n<td width="33%"> </td>\n <td bgcolor="1E90FF"> ...
## [20] <table width="355"><tr>\n<td width="56%"> </td>\n <td bgcolor="1E90FF"> ...
## ...