Web抓取未在数据框中输入值

时间:2018-10-18 04:40:15

标签: r web-scraping rvest removing-whitespace

我的主要目的是从网站中提取内容。我想将其保存在本地。内容在网站上更新后,还应该反映本地数据。
 我能够从代码中使用的网页中读取数据,现在我想将结果保存到数据框中,以便导出结果。我希望x6的值应该输入到数据框df中,以便我可以将数据框结果导出到文本文件或excel文件中,或者您可以建议任何其他方法从代码中使用的网页中提取数据(网页抓取) )。在这种情况下,我希望我的for循环无法正常工作,所以请任何人帮助我。

library(rvest)
library(dplyr)
library(qdapRegex) # install.packages("qdapRegex")

google <- read_html("https://bidplus.gem.gov.in/bidresultlists")

(x <- google %>%
  html_nodes(".block") %>%
  html_text())

class(x)

(x1 <- gsub("                                                            ", "", x))
(x2 <- gsub("                                                        ", "", x1))
(x3 <- gsub("            ", "", x2))
(x4 <- gsub("    ", "", x3))
(x5 <- gsub("  ", "", x4))
(x6 <- gsub("\n", "", x5))

class(x6)
length(x6[i])
typeof(x6)

for (i in x6) {

  BIDNO <- rm_between(x6[i], "BID NO:", "Status", extract = TRUE)
  Status <- rm_between(x6[i], "Status:", "Quantity Required", extract = TRUE)
  Quantity_Required <- rm_between(x6[i], "Quantity Required:", "Department Name And Address", extract = TRUE)
  Department_Name_And_Address <- rm_between(x6[i], "Department Name And Address:", "Start Date", extract = TRUE)
  Start_Date <- rm_between(x6[i], "Start Date:", "End Date", extract = TRUE)
  # End_Date <- rm_between(x6[i], "End Date: ", "Technical Evaluation", extract=TRUE)

  df <- data.frame("BID_NO", "Status", "Quantity_Required", "Department_Name_Address", "Start_Date")
}

df

View(df)

2 个答案:

答案 0 :(得分:1)

问题似乎是您创建的是一串带引号的“ BID_NO”等字符串。如果您尝试将值保存到数据框中,则需要保存将值保存到数据框中的变量名称。

actionItems

如果上面提供的所有创建每个字段的代码均正确无误,并且值已保存到这些变量中,您将获得一个ROW数据帧,因为它是在for循环中创建的,因此每次迭代时都将覆盖最后一个版本。

如果希望保存多行,请在循环之前创建ActionBar。然后

df<-data.frame(BID_NO,Status,Quantity_Required,Department_Name_Address,Start_Date)将在第一遍将数据行绑定到空白帧,然后每次添加新行。

但是在循环中创建的任何数据帧都会在每次通过时重新创建并覆盖...并保存变量中没有h final_df的值...

答案 1 :(得分:1)

使用XPath定位所需元素可能会减少挫败感和错误:

library(rvest)
library(dplyr)

pg <- read_html("https://bidplus.gem.gov.in/bidresultlists")

获取所有出价块:

blocks <- html_nodes(pg, ".block")

目标项目和数量div:

items_and_quantity <- html_nodes(blocks, xpath=".//div[@class='col-block' and contains(., 'Item(s)')]")

取出物品和数量

items <- html_nodes(items_and_quantity, xpath=".//strong[contains(., 'Item(s)')]/following-sibling::span") %>% html_text(trim=TRUE)
quantity <- html_nodes(items_and_quantity, xpath=".//strong[contains(., 'Quantity')]/following-sibling::span") %>% html_text(trim=TRUE) %>% as.numeric()

获取部门名称和地址。对其进行修改,以使三行之间用竖线(|)隔开。这样可以在以后进行分离。管道符号对于正则表达式来说是一个痛苦,因为它必须被转义,但是它在文本中出现的可能性很小,并且以后经常会引起混淆。

department_name_and_address <- html_nodes(blocks, xpath=".//div[@class='col-block' and contains(., 'Department Name And Address')]") %>% 
  html_text(trim=TRUE) %>% 
  gsub("\n", "|", .) %>% 
  gsub("[[:space:]]*\\||\\|[[:space:]]*", "|", .)

定位具有出价编号和状态的块标题:

block_header <- html_nodes(blocks, "div.block_header")

拉出出价#(请参见答案末尾的注释):

html_nodes(block_header, xpath=".//p[contains(@class, 'bid_no')]") %>%
  html_text(trim=TRUE) %>% 
  gsub("^.*: ", "", .) -> bid_no

拉出状态:

html_nodes(block_header, xpath=".//p/b[contains(., 'Status')]/following-sibling::span") %>% 
  html_text(trim=TRUE) -> status

定位并提取开始和结束日期:

html_nodes(blocks, xpath=".//strong[contains(., 'Start Date')]/following-sibling::span") %>%
  html_text(trim=TRUE) -> start_date

html_nodes(blocks, xpath=".//strong[contains(., 'End Date')]/following-sibling::span") %>%
  html_text(trim=TRUE) -> end_date

制作数据框:

data.frame(
  bid_no,
  status,
  start_date,
  end_date,
  items,
  quantity,
  department_name_and_address,
  stringsAsFactors=FALSE
) -> xdf

有些出价是“ RA”,因此我们还可以创建一列,让我们知道哪些是哪些:

xdf$is_ra <- grepl("/RA/", bid_no)

结果数据帧:

str(xdf)
## 'data.frame': 10 obs. of  8 variables:
##  $ bid_no                     : chr  "GEM/2018/B/93066" "GEM/2018/B/93082" "GEM/2018/B/93105" "GEM/2018/B/93999" ...
##  $ status                     : chr  "Not Evaluated" "Not Evaluated" "Not Evaluated" "Not Evaluated" ...
##  $ start_date                 : chr  "25-09-2018 03:53:pm" "27-09-2018 09:16:am" "25-09-2018 05:08:pm" "26-09-2018 05:21:pm" ...
##  $ end_date                   : chr  "18-10-2018 03:00:pm" "18-10-2018 03:00:pm" "18-10-2018 03:00:pm" "18-10-2018 03:00:pm" ...
##  $ items                      : chr  "automotive chassis fitted with engine" "automotive chassis fitted with engine" "automotive chassis fitted with engine" "Storage System" ...
##  $ quantity                   : num  1 1 1 2 90 1 981 6 4 376
##  $ department_name_and_address: chr  "Department Name And Address:||Ministry Of Steel Na Kirandul Complex N/a" "Department Name And Address:||Ministry Of Steel Na Kirandul Complex N/a" "Department Name And Address:||Ministry Of Steel Na Kirandul Complex N/a" "Department Name And Address:||Maharashtra Energy Department Maharashtra Bhusawal Tps N/a" ...
##  $ is_ra                      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

我将让您将日期变成POSIXct元素。

没有说明的连续代码是here

而且,这不是Java。 for循环很少是R中问题的解决方案。而且,您应该阅读正则表达式,因为计算替换空间也是充满危险和挫败感的路径。