我需要(非手动)下载此文件并将内容转换为data.frame,忽略几行的功能将非常有用。 我专门研究R或Python中的解决方案。
文件本身可以从以下网址获取:
http://horizons.prod.transmissionmedia.ca/GetDailyFundSummaryExcel.aspx?lang=en
这是我到目前为止所做的事情:
Error: IllegalArgumentException (Java): Your InputStream was neither an OLE2 stream, nor an OOXML stream
)Error in odbcConnectExcel("xl.file") :
odbcConnectExcel is only usable with 32-bit Windows
)XLRDError: Unsupported format or corrupt file
)Error in xls2sep(xls, sheet, verbose = verbose, ..., method = method, : Intermediate file '...' missing!
)如果您在记事本中打开文件,则它是一个xml文件,在Excel中打开时,会收到一条警告消息"格式和扩展名不匹配"。
我可以探索自己的想法也很有用,如果你没有答案,请发表评论。
到目前为止,我使用XML / regex进行了尝试:
library(XML)
library(stringr)
download.file("http://horizons.prod.transmissionmedia.ca/GetDailyFundSummaryExcel.aspx?lang=en", destfile = "horizons.xls")
doc <- readLines(con = "horizons.xls")
doc <- str_extract(doc,"<Table[^>]*>(.*?)</Table>")
doc <- xmlParse(doc)
listing <- xpathApply(doc, "//Row", xmlToDataFrame)
listing <- listing[4:length(listing)]
listing <- do.call(rbind,lapply(listing, t))[,6:16]
listing[,3:11] <- gsub("[^-.0-9]", "", listing[,3:11])
listing <- as.data.frame(listing, row.names = NULL,stringsAsFactors = FALSE,)
listing$V1 <- str_replace_all(listing$V1, "[^a-zA-Z0-9]", " ")
listing[5:11] <- lapply(listing[5:11],as.numeric)
names(listing) <- c(
"Product Name",
"Ticker",
"Class",
"Price",
"Price % Change",
"Volume",
"NAV/unit",
"NAV % Change",
"% Prem/Disc",
"Outst. Shares"
)
答案 0 :(得分:2)
也许可以在R中做到这一点:
library(XML)
download.file("http://horizons.prod.transmissionmedia.ca/GetDailyFundSummaryExcel.aspx?lang=en", file.path(tempdir(), "xls.xml"))
doc <- xmlParse(file.path(tempdir(), "xls.xml"))
df <- xmlToDataFrame(nodes = getNodeSet(doc, "//ss:Row", "ss")[-(1:2)], stringsAsFactors = FALSE)
names(df) <- unlist(df[1, ], use.names = F); df <- df[-1, ] # put first row as col header and delete it
head(df)
# # # Language ETF Type Subtype Product Name Ticker Class Closing Date Price Price % Change Volume NAV/unit NAV % Change % Prem/Disc Outst. Shares
# 2 1 1 en INDEX AND BENCHMARK Equities — Large Cap Horizons S&P 500® Index ETF HXS 2015-03-30 47.3800 2.09 314223 47.4302 1.9621 -0.11 5675671
# 3 2 2 en Horizons S&P 500® Index ETF (US$) HXS.U 2015-03-30 37.2800 -0.19 52769 37.3539 1.2312 -0.20 5675671
# 4 3 3 en Horizons S&P/TSX 60™ Index ETF HXT 2015-03-30 27.9600 0.98 372656 27.9144 0.9095 0.16 22019328
# 5 4 4 en Horizons S&P/TSX 60™ Index ETF (US$) HXT.U 2015-03-30 22.0300 -0.56 0 21.9842 0.1864 0.21 22019328
# 6 5 5 en Horizons S&P/TSX Capped Energy Index ETF HXE 2015-03-30 21.4800 0.00 1200 21.5441 0.6578 -0.30 902485
# 7 6 6 en Horizons S&P/TSX Capped Financials Index ETF HXF 2015-03-30 30.0100 0.00 900 30.0804 0.1395 -0.23 500440
答案 1 :(得分:1)
这可能不是最好的方法,但有助于提出一些建议。
require("XML")
myfile1 <- download.file(http://horizons.prod.transmissionmedia.ca/GetDailyFundSummaryExcel.aspx?lang=en)
doc <- xmlParse(myfile1)
root_doc <- xmlRoot(doc)
identify_worksheet <- c()
for (i in 1:xmlSize(root_doc)){
identify_worksheet <- c(identify_worksheet, xmlName(root_doc[[i]]) == "Worksheet")
}
worksheet_index = which(identify_worksheet == TRUE)
name1 <- xmlSApply(root_doc[[worksheet_index]], xmlName)
row_size <- xmlSize(root_doc[[worksheet_index]][[name1]])
col_size = max(xmlSApply(root_doc[[worksheet_index]][[name1]], xmlSize))
row_index = which(xmlSApply(root_doc[[worksheet_index]][[name1]], xmlSize) == max(xmlSApply(root_doc[[worksheet_index]][[name1]], xmlSize)))
df1 <- data.frame(matrix(nrow = length(row_index)-1, ncol = col_size), stringsAsFactors = FALSE)
colnames(df1) <- getChildrenStrings(root_doc[[worksheet_index]][[name1]][[row_index[1]]])
for(i in 2:length(row_index)){
df_index = i-1
df1[df_index,] <- getChildrenStrings(root_doc[[worksheet_index]][[name1]][[row_index[i]]])
}
View(df1)
df2 <- df1[4:ncol(df1)]
View(df2)
以xml格式识别Excel工作表中的名称。我想知道,以下名称是xml格式的Excel工作表的标准,并且当有多个工作表时,工作表的名称是否以数字作为后缀递增(例如:Worksheet1,Worksheet2等等。 )。
xmlName(root_doc)
[1] "Workbook"
xmlName(root_doc[[1]])
[1] "DocumentProperties"
xmlName(root_doc[[2]])
[1] "Styles"
xmlName(root_doc[[3]])
[1] "Worksheet"
输出
head(df1)
# # Language ETF Type Subtype Product Name Ticker Class Closing Date Price Price % Change Volume
1 1 1 en INDEX AND BENCHMARK Equities — Large Cap Horizons S&P 500® Index ETF HXS 2015-03-30 47.3800 2.09 314223
2 2 2 en Horizons S&P 500® Index ETF (US$) HXS.U 2015-03-30 37.2800 -0.19 52769
3 3 3 en Horizons S&P/TSX 60™ Index ETF HXT 2015-03-30 27.9600 0.98 372656
4 4 4 en Horizons S&P/TSX 60™ Index ETF (US$) HXT.U 2015-03-30 22.0300 -0.56 0
5 5 5 en Horizons S&P/TSX Capped Energy Index ETF HXE 2015-03-30 21.4800 0.00 1200
6 6 6 en Horizons S&P/TSX Capped Financials Index ETF HXF 2015-03-30 30.0100 0.00 900
NAV/unit NAV % Change % Prem/Disc Outst. Shares
1 47.4302 1.9621 -0.11 5675671
2 37.3539 1.2312 -0.20 5675671
3 27.9144 0.9095 0.16 22019328
4 21.9842 0.1864 0.21 22019328
5 21.5441 0.6578 -0.30 902485
6 30.0804 0.1395 -0.23 500440
head(df2)
ETF Type Subtype Product Name Ticker Class Closing Date Price Price % Change Volume NAV/unit
1 INDEX AND BENCHMARK Equities — Large Cap Horizons S&P 500® Index ETF HXS 2015-03-30 47.3800 2.09 314223 47.4302
2 Horizons S&P 500® Index ETF (US$) HXS.U 2015-03-30 37.2800 -0.19 52769 37.3539
3 Horizons S&P/TSX 60™ Index ETF HXT 2015-03-30 27.9600 0.98 372656 27.9144
4 Horizons S&P/TSX 60™ Index ETF (US$) HXT.U 2015-03-30 22.0300 -0.56 0 21.9842
5 Horizons S&P/TSX Capped Energy Index ETF HXE 2015-03-30 21.4800 0.00 1200 21.5441
6 Horizons S&P/TSX Capped Financials Index ETF HXF 2015-03-30 30.0100 0.00 900 30.0804
NAV % Change % Prem/Disc Outst. Shares
1 1.9621 -0.11 5675671
2 1.2312 -0.20 5675671
3 0.9095 0.16 22019328
4 0.1864 0.21 22019328
5 0.6578 -0.30 902485
6 0.1395 -0.23 500440