使用R - 多个网址收集数据

时间:2017-04-27 10:12:17

标签: r xml loops web-scraping

我有一个包含多个列和行的数据框 - 一些包含信息,一些用NA填充,应该用某些数据替换。

行代表特定的工具,列包含给定行中仪器的各种细节。数据框的最后一列有每个工具的网址,然后用于获取空列的数据:

 Issuer  NIN or ISIN           Type Nominal Value # of Bonds Issue Volume Start Date End Date
1 NBRK KZW1KD079112 discount notes            NA         NA           NA         NA       NA
2 NBRK KZW1KD079146 discount notes            NA         NA           NA         NA       NA
3 NBRK KZW1KD079153 discount notes            NA         NA           NA         NA       NA
4 NBRK KZW1KD089137 discount notes            NA         NA           NA         NA       NA

 URL
1 http://www.kase.kz/en/gsecs/show/NTK007_1911
2 http://www.kase.kz/en/gsecs/show/NTK007_1914
3 http://www.kase.kz/en/gsecs/show/NTK007_1915
4 http://www.kase.kz/en/gsecs/show/NTK008_1913

例如,使用以下代码,我将获得行NBRK KZW1KD079112中第一个工具的详细信息:

sp = readHTMLTable(newd$URL[[1]])
sp[[4]]

其中包含以下内容:

                                            V1                                                              

    V2
1                                     Trading code                                                         NTK007_1911
2                               List of securities                                                            official
3                              System of quotation                                                               price
4                                Unit of quotation                                   nominal value percentage fraction
5                               Quotation currency                                                                 KZT
6                               Quotation accuracy                                                        4 characters
7                       Trade lists admission date                                                            04/21/17
8                               Trade opening date                                                            04/24/17
9                       Trade lists exclusion date                                                            04/28/17
10                                        Security                                                                <NA>
11                                     Bond's name short-term notes of the National Bank of the Republic of Kazakhstan
12                                            NSIN                                                        KZW1KD079112
13                   Currency of issue and service                                                                 KZT
14               Nominal value in issue's currency                                                              100.00
15                      Number of registered bonds                                                       1,929,319,196
16                     Number of bonds outstanding                                                       1,929,319,196
17                               Issue volume, KZT                                                     192,931,919,600
18 Settlement basis (days in month / days in year)                                                        actual / 365
19                       Date of circulation start                                                            04/21/17
20                          Circulation term, days                                                                   7
21              Register fixation date at maturity                                                            04/27/17
22                        Principal repayment date                                                            04/28/17
23                                    Paying agent                          Central securities depository JSC (Almaty)
24                                       Registrar                          Central securities depository JSC (Almaty)

由此,我将只保留:

14               Nominal value in issue's currency                                                              100.00
16                     Number of bonds outstanding                                                       1,929,319,196
17                               Issue volume, KZT                                                     192,931,919,600
19                       Date of circulation start                                                            04/21/17
22                        Principal repayment date                                                            04/28/17

然后,我将所需的数据复制到初始数据框并继续下一行...数据框由100多行组成,并将不断变化。

我将不胜感激。

更新

看起来我需要的数据并不总是在sp[[4]]中。有时它的sp[[7]],也许在未来它将是完全不同的表。有没有办法在刮表中查找信息并确定可以进一步用于收集数据的特定表?:

sp = readHTMLTable(newd$URL[[1]])
sp[[4]]

1 个答案:

答案 0 :(得分:1)

library(XML)
library(reshape2)
library(dplyr)

name = c(
"NBRK KZW1KD079112 discount notes",                                           
"NBRK KZW1KD079146 discount notes",                                        
"NBRK KZW1KD079153 discount notes",                                         
"NBRK KZW1KD089137 discount notes")                                           

URL = c(
"http://www.kase.kz/en/gsecs/show/NTK007_1911",
"http://www.kase.kz/en/gsecs/show/NTK007_1914",
"http://www.kase.kz/en/gsecs/show/NTK007_1915",
"http://www.kase.kz/en/gsecs/show/NTK008_1913")

# data
instruments <- data.frame(name, URL, stringsAsFactors = FALSE)

# define the columns wanted and the mapping to desired name
# extend to all wanted columns
wanted <- c("Nominal value in issue's currency" = "Nominal Value",
            "Number of bonds outstanding" = "# of Bonds Issue")

# function returns a data frame of wanted columns for given URL
getValues <- function (name, url) {
  # get the table and rename columns
  sp = readHTMLTable(url, stringsAsFactors = FALSE)
  df <- sp[[4]]
  names(df) <- c("full_name", "value")

  # filter and remap wanted columns
  result <- df[df$full_name %in% names(wanted),]
  result$column_name <- sapply(result$full_name, function(x) {wanted[[x]]})

  # add the identifier to every row
  result$name <- name
  return (result[,c("name", "column_name", "value")])
}

# invoke function for each name/URL pair - returns list of data frames
columns <- apply(instruments[,c("name", "URL")], 1, function(x) {getValues(x[["name"]], x[["URL"]])})

# bind using dplyr:bind_rows to make a tall data frame
tall <- bind_rows(columns)

# make wide using dcast from reshape2
wide <- dcast(tall, name ~ column_name, id.vars = "value")

wide

#                               name # of Bonds Issue Nominal Value
# 1 NBRK KZW1KD079112 discount notes    1,929,319,196        100.00
# 2 NBRK KZW1KD079146 discount notes    1,575,000,000        100.00
# 3 NBRK KZW1KD079153 discount notes      701,390,693        100.00
# 4 NBRK KZW1KD089137 discount notes    1,380,368,000        100.00

    enter code here