我正在尝试使用R中的正则表达式提取一些财务数据。
我使用了一个RegEx测试器http://regexr.com/来制作一个正则表达式,该表达式应该捕获我需要的信息 - 问题只是它没有......
我从此网址中提取了数据:http://finance.yahoo.com/q/cp?s=%5EOMXC20+Components
我想匹配公司名称(DANSKE.CO,DSV.CO等),我创建了以下正则表达式,在regexr.com上匹配它:
#blue {
width: 100%;
}
.twitter-iframe {
width: 30%;
}
但它在R中不起作用。有人可以帮我弄清楚如何解决这个问题吗?
答案 0 :(得分:2)
不要乱用正则表达式,而是使用XPath来获取HTML内容:
library("XML")
f <- tempfile()
download.file("https://finance.yahoo.com/q/cp?s=^OMXC20+Components", f)
doc <- htmlParse(f)
xpathSApply(doc, "//b/a", xmlValue)
# [1] "CARL-B.CO" "CHR.CO" "COLO-B.CO" "DANSKE.CO" "DSV.CO"
# [6] "FLS.CO" "GEN.CO" "GN.CO" "ISS.CO" "JYSK.CO"
# [11] "MAERSK-A.CO" "MAERSK-B.CO" "NDA-DKK.CO" "NOVO-B.CO" "NZYM-B.CO"
# [16] "PNDORA.CO" "TDC.CO" "TRYG.CO" "VWS.CO" "WDH.CO"
答案 1 :(得分:0)
这有帮助吗?如果没有,请回复,我会提供另一个建议。
library(XML)
stocks <- c("AXP","BA","CAT","CSCO")
for (s in stocks) {
url <- paste0("http://finviz.com/quote.ashx?t=", s)
webpage <- readLines(url)
html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
tableNodes <- getNodeSet(html, "//table")
# ASSIGN TO STOCK NAMED DFS
assign(s, readHTMLTable(tableNodes[[9]],
header= c("data1", "data2", "data3", "data4", "data5", "data6",
"data7", "data8", "data9", "data10", "data11", "data12")))
# ADD COLUMN TO IDENTIFY STOCK
df <- get(s)
df['stock'] <- s
assign(s, df)
}
# COMBINE ALL STOCK DATA
stockdatalist <- cbind(mget(stocks))
stockdata <- do.call(rbind, stockdatalist)
# MOVE STOCK ID TO FIRST COLUMN
stockdata <- stockdata[, c(ncol(stockdata), 1:ncol(stockdata)-1)]
# SAVE TO CSV
write.table(stockdata, "C:/Users/rshuell001/Desktop/MyData.csv", sep=",",
row.names=FALSE, col.names=FALSE)
# REMOVE TEMP OBJECTS
rm(df, stockdatalist)