刮擦R中的许多页面

时间:2018-06-14 16:43:22

标签: r web-scraping

我一直试图从https://pdki-indonesia.dgip.go.id/index.php/paten?type=2&q8=1993&q27=ID&skip=0抓取专利数据集。我想收集以下信息:申请号,状态和标题。我已经能够刮掉第一页。但是,我需要保存所有6页。知道如何以最简单的方式在R中做到这一点吗? 以下是我的代码:

library(rvest)
url <- 'https://pdki-indonesia.dgip.go.id/index.php/paten?type=2&q8=1993&q27=ID&skip=0'
webpage <- read_html(url)
no_html <- html_nodes(webpage,'.number')
no_data <- html_text(no_html)
status_html <- html_nodes(webpage,'.approved')
status_data <- html_text(status_html)
title_html <- html_nodes(webpage,'.title')
title_data <- html_text(title_html)
DF1 <- as.data.frame(cbind(no_data,status_data, title_data))
write.csv(DF1,"ptn.csv")

事先非常感谢!

2 个答案:

答案 0 :(得分:0)

出现每页只有10个条目。您可以更改url,使其最后显示...=ID&skip=10以进入第二页。第三页将是skip=20,依此类推。

以下是这样做的一种方式:

# this is the url we keep modifying
base_url <- 'https://pdki-indonesia.dgip.go.id/index.php/paten?type=2&q8=1993&q27=ID&skip='

# get list of all urls we need to access
urls <- sapply(seq(from = 0, by = 10, length.out = 6), function(z) return(paste(base_url,z,sep = "")))

library(rvest)

# Using your current code
readUrl <- function(url){
  webpage <- read_html(url)
  no_html <- html_nodes(webpage,'.number')
  no_data <- html_text(no_html)
  status_html <- html_nodes(webpage,'.approved')
  status_data <- html_text(status_html)
  title_html <- html_nodes(webpage,'.title')
  title_data <- html_text(title_html)
  DF1 <- as.data.frame(cbind(no_data,status_data, title_data))
  return(DF1)
}

output_list <- lapply(urls, readUrl)

我认为您的代码中的html标签等存在问题,因为no_datastatus_datatitle_data的长度不同,导致较短的回收向量。

> x <- readUrl(urls[1])
Warning message:
In cbind(no_data, status_data, title_data) :
  number of rows of result is not a multiple of vector length (arg 1)
> str(x)
'data.frame':   13 obs. of  3 variables:
 $ no_data    : Factor w/ 10 levels "P00199305441",..: 10 1 2 3 5 6 8 9 4 7 ...
 $ status_data: Factor w/ 5 levels "Berakhir","Dalam Proses",..: 4 1 4 4 4 4 5 2 5 5 ...
 $ title_data : Factor w/ 13 levels "ADISI YODIUM PADA PEANUT OIL TANPA MELALUI ESTERIFIKASI",..: 8 2 5 9 6 7 12 11 10 1 ...

对于urls[1](OP中的那个),no_datastatus_datatitle_data的长度分别为10,11和13。

答案 1 :(得分:0)

非常感谢您的帮助。我已经使用以下代码使用Python解决了我的问题:

from urllib.request import urlopen
from bs4 import BeautifulSoup 
file = "paten.csv"
f = open(file, "w")
Headers = "Nomor, Status, Judul\n"
f.write(Headers)
for skip in range(0, 20, 10):
    url = "https://pdki-indonesia.dgip.go.id/index.php/paten?type=2&q8=1993&q27=ID&skip={}".format(skip)
    html = urlopen(url)
    soup = BeautifulSoup(html,"html.parser")
    Title = soup.find_all("div", {"class":"content"})
    for i in Title:
        try:
            Nomor = i.find("span", {"class":"number"}).get_text()
            Status = i.find("span", {"class":"approved"}).get_text()
            Judul = i.find('a', {"class":"title"}).get_text()
            print(Nomor, Status, Judul)
            f.write("{}".format(Nomor).replace(",","|")+ ",{}".format(Status).replace(",", " ")+ ",{}".format(Judul).replace(",", " ")+ "\n")
        except: AttributeError
f.close()