我有一个包含多个列和行的数据框 - 一些包含信息,一些用NA填充,应该用某些数据替换。
行代表特定的工具,列包含给定行中仪器的各种细节。数据框的最后一列有每个工具的网址,然后用于获取空列的数据:
Issuer NIN or ISIN Type Nominal Value # of Bonds Issue Volume Start Date End Date
1 NBRK KZW1KD079112 discount notes NA NA NA NA NA
2 NBRK KZW1KD079146 discount notes NA NA NA NA NA
3 NBRK KZW1KD079153 discount notes NA NA NA NA NA
4 NBRK KZW1KD089137 discount notes NA NA NA NA NA
URL
1 http://www.kase.kz/en/gsecs/show/NTK007_1911
2 http://www.kase.kz/en/gsecs/show/NTK007_1914
3 http://www.kase.kz/en/gsecs/show/NTK007_1915
4 http://www.kase.kz/en/gsecs/show/NTK008_1913
例如,使用以下代码,我将获得行NBRK KZW1KD079112
中第一个工具的详细信息:
sp = readHTMLTable(newd$URL[[1]])
sp[[4]]
其中包含以下内容:
V1
V2
1 Trading code NTK007_1911
2 List of securities official
3 System of quotation price
4 Unit of quotation nominal value percentage fraction
5 Quotation currency KZT
6 Quotation accuracy 4 characters
7 Trade lists admission date 04/21/17
8 Trade opening date 04/24/17
9 Trade lists exclusion date 04/28/17
10 Security <NA>
11 Bond's name short-term notes of the National Bank of the Republic of Kazakhstan
12 NSIN KZW1KD079112
13 Currency of issue and service KZT
14 Nominal value in issue's currency 100.00
15 Number of registered bonds 1,929,319,196
16 Number of bonds outstanding 1,929,319,196
17 Issue volume, KZT 192,931,919,600
18 Settlement basis (days in month / days in year) actual / 365
19 Date of circulation start 04/21/17
20 Circulation term, days 7
21 Register fixation date at maturity 04/27/17
22 Principal repayment date 04/28/17
23 Paying agent Central securities depository JSC (Almaty)
24 Registrar Central securities depository JSC (Almaty)
由此,我将只保留:
14 Nominal value in issue's currency 100.00
16 Number of bonds outstanding 1,929,319,196
17 Issue volume, KZT 192,931,919,600
19 Date of circulation start 04/21/17
22 Principal repayment date 04/28/17
然后,我将所需的数据复制到初始数据框并继续下一行...数据框由100多行组成,并将不断变化。
我将不胜感激。
更新
看起来我需要的数据并不总是在sp[[4]]
中。有时它的sp[[7]]
,也许在未来它将是完全不同的表。有没有办法在刮表中查找信息并确定可以进一步用于收集数据的特定表?:
sp = readHTMLTable(newd$URL[[1]])
sp[[4]]
答案 0 :(得分:1)
library(XML)
library(reshape2)
library(dplyr)
name = c(
"NBRK KZW1KD079112 discount notes",
"NBRK KZW1KD079146 discount notes",
"NBRK KZW1KD079153 discount notes",
"NBRK KZW1KD089137 discount notes")
URL = c(
"http://www.kase.kz/en/gsecs/show/NTK007_1911",
"http://www.kase.kz/en/gsecs/show/NTK007_1914",
"http://www.kase.kz/en/gsecs/show/NTK007_1915",
"http://www.kase.kz/en/gsecs/show/NTK008_1913")
# data
instruments <- data.frame(name, URL, stringsAsFactors = FALSE)
# define the columns wanted and the mapping to desired name
# extend to all wanted columns
wanted <- c("Nominal value in issue's currency" = "Nominal Value",
"Number of bonds outstanding" = "# of Bonds Issue")
# function returns a data frame of wanted columns for given URL
getValues <- function (name, url) {
# get the table and rename columns
sp = readHTMLTable(url, stringsAsFactors = FALSE)
df <- sp[[4]]
names(df) <- c("full_name", "value")
# filter and remap wanted columns
result <- df[df$full_name %in% names(wanted),]
result$column_name <- sapply(result$full_name, function(x) {wanted[[x]]})
# add the identifier to every row
result$name <- name
return (result[,c("name", "column_name", "value")])
}
# invoke function for each name/URL pair - returns list of data frames
columns <- apply(instruments[,c("name", "URL")], 1, function(x) {getValues(x[["name"]], x[["URL"]])})
# bind using dplyr:bind_rows to make a tall data frame
tall <- bind_rows(columns)
# make wide using dcast from reshape2
wide <- dcast(tall, name ~ column_name, id.vars = "value")
wide
# name # of Bonds Issue Nominal Value
# 1 NBRK KZW1KD079112 discount notes 1,929,319,196 100.00
# 2 NBRK KZW1KD079146 discount notes 1,575,000,000 100.00
# 3 NBRK KZW1KD079153 discount notes 701,390,693 100.00
# 4 NBRK KZW1KD089137 discount notes 1,380,368,000 100.00
enter code here