我有一个通过网页抓取获得的名为“价格”的数据框。目的是跟踪津巴布韦证券交易所股票的每日价格。
从网站抓取网页
library(rvest)
library(stringr)
library(reshape2)
# Data from African Financials
url <- "https://africanfinancials.com/zimbabwe-stock-exchange-share-prices/"
prices <- url %>%
read_html() %>%
html_table(fill = T)
prices <- prices[[1]]
价格数据框:
> prices
Counter PriceRTGS cents Volume ChangeRTGS cents ChangePercent YTDPercent
1 AFDS.zw Afdis 169.75 4 Apr 19 0 0.00 0.00% 10.95%
2 ARIS.zw Ariston 2.90 4 Apr 19 572 -0.03 -1.02% 20.83%
3 ARTD.zw ART Holdings 9.20 4 Apr 19 0 0.00 0.00% 4.55%
我想将“ PriceRTGS美分”列分为“价格RTGS美分”和“日期”两列。
我尝试使用下面的代码,但它在价格列中捕获了每月的4号。
str_split_fixed(prices$`PriceRTGS cents`," ", 2)
colsplit(prices$`PriceRTGS cents`," ",c("Price RTGS Cents", "Date"))
我希望输出看起来像这样:
Counter Price RTGS Cents Date Volume ChangeRTGS cents ChangePercent YTDPercent
1 AFDS.zw Afdis 169.75 4/04/2019 0 0.00 0.00% 10.95%
2 ARIS.zw Ariston 2.90 4/04/2019 572 -0.03 -1.02% 20.83%
3 ARTD.zw ART Holdings 9.20 4/04/2019 0 0.00 0.00% 4.55%
输出数据:
structure(list(Counter = c("AFDS.zw Afdis", "ARIS.zw Ariston",
"ARTD.zw ART Holdings", "ASUN.zw Africansun", "AXIA.zw Axia",
"BAT.zw BAT"), `PriceRTGS cents` = c("169.75 4 Apr 19", "2.90 4 Apr 19",
"9.20 4 Apr 19", "15.00 4 Apr 19", "35.05 4 Apr 19", "3,000.00 4 Apr 19"
), Volume = c("0", "572", "0", "0", "8,557", "0"), `ChangeRTGS cents` = c(0,
-0.03, 0, 0, 0, 0), ChangePercent = c("0.00%", "-1.02%", "0.00%",
"0.00%", "0.00%", "0.00%"), YTDPercent = c("10.95%", "20.83%",
"4.55%", "50.00%", "-22.11%", "-9.09%")), row.names = c(NA, 6L
), class = "data.frame")
答案 0 :(得分:1)
我只是将您的第一个价格数据复制并粘贴到文本编辑器中,并使用“;”更改空格。 (我尚未看到您的数据版本)。
prices <- read.table("dat.txt", sep=";", header=T)
种类繁多的“快速而肮脏的”代码,但是可以正常工作:
str_split_fixed(prices$PriceRTGS.cents," ", 2)
new_prices <- data.frame(prices$Counter, str_split_fixed(prices$PriceRTGS.cents," ", 2), prices$Volume, prices$ChangeRTGS.cents, prices$ChangePercent, prices$YTDPercent)
colnames(new_prices) <- c("Counter", "PriceRTGS_cents", "Date", "Volume", "ChangeRTGS cents", "ChangePercent", "YTDPercent")
new_prices$Date <- gsub("Apr", "04", new_prices$Date)
new_prices$Date <- gsub(" ", "/", new_prices$Date)
new_prices <- data.frame(prices$Counter, new_prices$PriceRTGS_cents, new_prices$Date, prices$Volume, prices$ChangeRTGS.cents, prices$ChangePercent, prices$YTDPercent)
colnames(new_prices) <- c("Counter", "PriceRTGS_cents", "Date", "Volume", "ChangeRTGS cents", "ChangePercent", "YTDPercent")
new_prices
如果您除了“ Apr”以外还有其他月份,只需添加其他行 (例如:如果是“十一月”)
new_prices$Date <- gsub("Nov", "10", new_prices$Date)
new_prices$Date <- gsub(" ", "/", new_prices$Date)
答案 1 :(得分:1)
另一种选择。分隔符(-)和日期格式,可以更改列名:
prices$Prices<-stringr::str_extract_all(prices$`PriceRTGS cents`,"\\d{1,}.*\\.\\d{1,}",simplify=T)
prices$Dates<-stringr::str_remove_all(prices$`PriceRTGS cents`,"\\d{1,}.*\\.\\d{1,} ")
prices %>%
select(-`PriceRTGS cents`) %>%
mutate(Dates=lubridate::dmy(Dates))
结果:
Counter Volume ChangeRTGS cents ChangePercent YTDPercent Prices Dates
1 AFDS.zw Afdis 0 0.00 0.00% 10.95% 169.75 2019-04-04
2 ARIS.zw Ariston 572 -0.03 -1.02% 20.83% 2.90 2019-04-04
3 ARTD.zw ART Holdings 0 0.00 0.00% 4.55% 9.20 2019-04-04
4 ASUN.zw Africansun 0 0.00 0.00% 50.00% 15.00 2019-04-04
5 AXIA.zw Axia 8,557 0.00 0.00% -22.11% 35.05 2019-04-04
6 BAT.zw BAT 0 0.00 0.00% -9.09% 3,000.00 2019-04-04
答案 2 :(得分:0)
您可以做些什么-
library(data.table)
setDT(dt)
dt[,Date:=sub("^\\S+\\s+", "\\1", `PriceRTGS cents`)]
dt[,cents:=sub("^\\s*(\\S+\\S+).*", "\\1", `PriceRTGS cents`)]
注意-稍后从dt
> dt <- subset(dt,select = -c(`PriceRTGS cents`))
> dt
Counter Volume ChangeRTGS cents ChangePercent YTDPercent cents Date
1: AFDS.zw Afdis 0 0.00 0.00% 10.95% 169.75 4 Apr 19
2: ARIS.zw Ariston 572 -0.03 -1.02% 20.83% 2.90 4 Apr 19
3: ARTD.zw ART Holdings 0 0.00 0.00% 4.55% 9.20 4 Apr 19
4: ASUN.zw Africansun 0 0.00 0.00% 50.00% 15.00 4 Apr 19
5: AXIA.zw Axia 8,557 0.00 0.00% -22.11% 35.05 4 Apr 19
6: BAT.zw BAT 0 0.00 0.00% -9.09% 3,000.00 4 Apr 19
编辑-如果您要使用Date
,请执行此操作-
dt[,Date:=format(as.Date(sub("^\\S+\\s+", "\\1", `PriceRTGS cents`),format='%d %b %Y'),"%d/%m/%Y")]
答案 3 :(得分:0)
一个基本的R选项是在空白处分割并创建字符串的两个部分,首先是价格部分,其余部分一起作为日期。
t(sapply(strsplit(prices$`PriceRTGS cents`,"\\s+"), function(x)
c(x[1], format(as.Date(paste0(x[-1], collapse = "-"), "%d-%b-%y"), "%d/%m/%Y"))))
# [,1] [,2]
#[1,] "169.75" "04/04/2019"
#[2,] "2.90" "04/04/2019"
#[3,] "9.20" "04/04/2019"
#[4,] "15.00" "04/04/2019"
#[5,] "35.05" "04/04/2019"
#[6,] "3,000.00" "04/04/2019"
然后可以将这两列cbind
移至原始数据帧。
如果您可以保留日期列而不进行任何格式化,则可以放开as.Date
和format
并缩短代码。
t(sapply(strsplit(prices$`PriceRTGS cents`,"\\s+"), function(x)
c(x[1], paste0(x[-1], collapse = "-"))))
答案 4 :(得分:0)
您来了:类似于您的str_split_fixed
解决方案。它还会从价格变量中删除逗号,以便可以将其强制转换为numeric
并设置日期列的格式。
split_string <- str_split(prices$`PriceRTGS cents`, regex("\\s"), 2, simplify = T)
prices$Price <- as.numeric(gsub(",", "", split_string[,1], fixed = T))
prices$Date <- as.Date(split_string[,2], format = "%d %b %y")
head(prices[-2])
Counter Volume ChangeRTGS cents ChangePercent YTDPercent Price Date
1 AFDS.zw Afdis 0 0.00 0.00% 10.95% 169.75 2019-04-04
2 ARIS.zw Ariston 572 -0.03 -1.02% 20.83% 2.90 2019-04-04
3 ARTD.zw ART Holdings 0 0.00 0.00% 4.55% 9.20 2019-04-04
4 ASUN.zw Africansun 0 0.00 0.00% 50.00% 15.00 2019-04-04
5 AXIA.zw Axia 8,557 0.00 0.00% -22.11% 35.05 2019-04-04
6 BAT.zw BAT 0 0.00 0.00% -9.09% 3000.00 2019-04-04
固定解决方案的问题在于,它无法识别价格后的空格,即:
table(str_count(prices$`PriceRTGS cents`, fixed(" ")))
2
55
但是它确实将正则表达式用于空格,即:
table(str_count(prices$`PriceRTGS cents`, regex("\\s")))
3
55