我想从瑞士政府那里获取有关某大学研究项目的药品信息:
https://docs.microsoft.com/de-de/rest/api/power-bi/pushdatasets/datasets_postdataset
该页面确实提供了robotx.txt文件,但是其内容可免费向公众公开,我认为禁止抓取该数据。
这是http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue=的更新,因为我取得了一些进展。
到目前为止我取得的成就
# opens the first results page
# opens the first link as a table at the end of the page
library("rvest")
library("dplyr")
url <- "http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue="
pgsession<-html_session(url)
pgform<-html_form(pgsession)[[1]]
page<-rvest:::request_POST(pgsession,url,
body=list(
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=1,
`__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
`__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
`__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
`__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
`__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
`__EVENTARGUMENT`=""
),
encode="form")
下一步:获取基本数据
# makes a table of all results of the first page
read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
html_table(fill=TRUE) %>%
bind_rows %>%
tibble()
下一步:获取其他数据
# gives the desired informations (=additional data) of the first drug (not yet very structured)
read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_fvwPreparation"]') %>%
html_text
我的问题:
# if I open the second search page
page<-rvest:::request_POST(pgsession,url,
body=list(
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=2,
`__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
`__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
`__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
`__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
`__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
`__EVENTARGUMENT`=""
),
encode="form")
下一步:获取新的基本数据
# I get easily a table with the new results
read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
html_table(fill=TRUE) %>%
bind_rows %>%
tibble()
但是,如果我尝试获取新的其他数据,则会再次从第1页获得结果:
# does not give the desired output:
read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_fvwPreparation"]') %>%
html_text
我要寻找的是:第2页第一种药物的详细数据 of this question
问题:
__VIEWSTATE
可能
在新的request_POST
期间进行更改?答案 0 :(得分:3)
我认为您只是在想这个问题。问题出在xpath
上。本质上,您用于数据提取的xpath
对于所有页面都是相同的。它就是//*[@id="ctl00_cphContent_gvwPreparations"]
,代码中唯一发生变化的组件是txtPageNumber
。在下面的代码中,我将txtPageNumber
更改为3
,例如txtPageNumber=3
,建议您将重点放在如何自动进行页码编号以提取数据上?。这样,您不必在
txtPageNumber
page<-rvest:::request_POST(pgsession,url,
body=list(
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=3,
`__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
`__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
`__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
`__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
`__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
`__EVENTARGUMENT`=""
),
encode="form")
以下代码对我有用;
library(rvest)
library(dplyr)
url <- "http://www.spezialitaetenliste.ch/ShowPreparations.aspx?searchType=Substance&searchValue="
pgsession<-html_session(url)
pgform<-html_form(pgsession)[[1]]
page<-rvest:::request_POST(pgsession,url,
body=list(
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$txtPageNumber`=3,
`__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
`__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
`__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
`__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value,
`ctl00$cphContent$gvwPreparations$ctl13$gvwpPreparations$ddlPageSize`="10",
`__EVENTTARGET`="ctl00$cphContent$gvwPreparations$ctl02$ctl00",
`__EVENTARGUMENT`=""
),
encode="form")
# makes a table of all results of the first page
read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
html_table(fill=TRUE) %>%
bind_rows %>%
tibble()
# A tibble: 11 x 1
.$`` $Präparat $`Galen. Form /~ $Packung $FAP $PP $SB $`Lim-Pkt` $Lim
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 21. Accolate Tabl 20 mg 60 Stk 29.75 50.55 "" "" ""
2 22. Accupaque Inj Lös 300 mg Plast F~ 32.00 53.10 "" "" ""
3 23. Accupaque Inj Lös 300 mg Plast F~ 61.15 86.60 "" "" ""
4 24. Accupaque Inj Lös 300 mg Plast F~ 120.~ 154.~ "" "" ""
5 25. Accupaque Inj Lös 350 mg Plast F~ 33.97 55.35 "" "" ""
6 26. Accupaque Inj Lös 350 mg Plast F~ 66.88 93.20 "" "" ""
7 27. Accupaque Inj Lös 350 mg Plast F~ 129.~ 164.~ "" "" ""
8 28. Accupro ~ Filmtabl 10 mg 30 Stk 8.56 18.00 "" "" ""
9 29. Accupro ~ Filmtabl 10 mg 100 Stk 26.60 46.90 "" "" ""
10 30. Accupro ~ Filmtabl 20 mg 30 Stk 14.02 28.35 "" "" ""
11 "Ein~ "Einträg~ "Einträge pro S~ "Einträ~ "Ein~ "Ein~ "Ein~ "Einträge~ "Ein~
# ... with 9 more variables: $`Swissmedic-Code` <chr>, $Zulassungsinhaberin <chr>,
# $Wirkstoff <chr>, $`BAG-Dossier` <chr>, $Aufnahme <chr>, $`Befr. AufnahmeBefr.
# Limitation` <chr>, $`O/G` <chr>, $`IT-Code` <chr>, $`ATC-Code` <chr>
# gives the desired informations of the first drug (not yet very structured)
read_html(page) %>%
html_nodes(xpath = '//*[@id="ctl00_cphContent_gvwPreparations"]') %>%
html_text %>%
head(10)
[1] " PräparatGalen. Form / DosierungPackungFAPPPSBLim-PktLimSwissmedic-CodeZulassungsinhaberinWirkstoffBAG-DossierAufnahmeBefr. AufnahmeBefr. LimitationO/GIT-CodeATC-Code\r\n\t\t\t\t\r\n 21.\r\n \r\n Accolate\r\n \r\n Tabl 20 mg \r\n \r\n 60 Stk\r\n \r\n 29.75\r\n \r\n 50.55\r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n 53750036\r\n \r\n AstraZeneca AG\r\n \r\n Zafirlukastum\r\n \r\n 17053\r\n \r\n 15.03.1998\r\n \r\n \r\n \r\n \r\n \r\n \r\n 03.04.50.\r\n \r\n R03DC01\r\n \r\n\t\t\t\t\r\n 22.\r\n \r\n Accupaque\r\n \r\n