我正在使用webscraper从FinViz中抓取一些数据。这是an example 问题是数据框很乱,第一列包含我理想的标题,第二列包含相应的数据。这是一个输出:
data1 data2 data3 data4 data5 data6 data7 data8 data9 data10
1 Index S&P 500 P/E 36.13 EPS (ttm) 4.60 Insider Own 0.10% Shs Outstand 2.93B
2 Market Cap 487.15B Forward P/E 25.65 EPS next Y 6.48 Insider Trans -86.95% Shs Float 2.33B
3 Income 13.58B PEG 1.36 EPS next Q 1.27 Inst Own 72.50% Short Float 0.87%
4 Sales 33.17B P/S 14.69 EPS this Y 170.20% Inst Trans -0.22% Short Ratio 1.13
5 Book/sh 22.92 P/B 7.26 EPS next Y 21.63% ROA 20.30% Target Price 192.62
6 Cash/sh 12.10 P/C 13.74 EPS next 5Y 26.57% ROE 22.50% 52W Range 113.55 - 175.49
7 Dividend - P/FCF 34.05 EPS past 5Y 62.10% ROI 17.10% 52W High -5.23%
8 Dividend % - Quick Ratio 12.30 Sales past 5Y 49.40% Gross Margin 86.60% 52W Low 46.47%
9 Employees 20658 Current Ratio 12.30 Sales Q/Q 44.80% Oper. Margin 46.40% RSI (14) 49.05
10 Optionable Yes Debt/Eq 0.00 EPS Q/Q 68.80% Profit Margin 40.90% Rel Volume 0.70
11 Shortable Yes LT Debt/Eq 0.00 Earnings Jul 26 AMC Payout 0.00% Avg Volume 17.87M
12 Recom 1.70 SMA20 -1.84% SMA50 2.85% SMA200 17.52% Volume 12,583,873
如您所见,data1包含类别,data2包含以下信息。
理想情况下,我希望它在这个结构中:
Index | Market Cap | Income | Sales | Book sh | ...
------------------------------------------------
S&P500 | 487.15B | 13.58B | 33.17B | 22.92 |
因此data1,3,5,7是所有标题和数据2,4,6,8所有在一行中。
有人可以提供任何意见吗?我试图避免将它们编译成2个不同的向量然后将帧结合在一起。
Cheerio!
答案 0 :(得分:0)
您可以尝试:
library(data.table); library(dplyr)
table1 <- df[, 1:2] %>%as.data.table() %>% dcast.data.table(.~data1, value.var = "data2")
table2 <- df[, 3:4] %>%as.data.table() %>% dcast.data.table(.~data3, value.var = "data4")
cbind(table1, table2)
依旧等等
答案 1 :(得分:0)
这会有用吗?
data <- data.frame(data1= letters[1:10],data2=LETTERS[1:10],data3= letters[11:20],data4=LETTERS[11:20],stringsAsFactors=F)
# data1 data2 data3 data4
# 1 a A k K
# 2 b B l L
# 3 c C m M
# 4 d D n N
# 5 e E o O
# 6 f F p P
# 7 g G q Q
# 8 h H r R
# 9 i I s S
# 10 j J t T
output <- setNames(data.frame(
t(unlist(data[!as.logical(seq_along(data)%%2)]))),
unlist(data[as.logical(seq_along(data)%%2)]))
# a b c d e f g h i j k l m n o p q r s t
# 1 A B C D E F G H I J K L M N O P Q R S T
答案 2 :(得分:0)
以下是使用一些tidyverse
包和您的数据集的解决方案。
library(rvest) # for scrapping the data
#> Le chargement a nécessité le package : xml2
library(dplyr, warn.conflicts = F)
library(tidyr)
library(purrr, warn.conflict = F)
Fisrt,我们直接从您的示例网址获取您的数据。
tab <- read_html("http://finviz.com/quote.ashx?t=BA") %>%
html_node("table.snapshot-table2") %>%
html_table(header = F) %>%
as_data_frame()
tab
#> # A tibble: 12 x 12
#> X1 X2 X3 X4 X5 X6
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Index DJIA S&P500 P/E 20.77 EPS (ttm) 11.42
#> 2 Market Cap 141.89B Forward P/E 22.14 EPS next Y 10.71
#> 3 Income 7.12B PEG 1.13 EPS next Q 2.62
#> 4 Sales 90.90B P/S 1.56 EPS this Y 2.30%
#> 5 Book/sh -3.34 P/B - EPS next Y 7.28%
#> 6 Cash/sh 17.26 P/C 13.74 EPS next 5Y 18.36%
#> 7 Dividend 5.68 P/FCF 17.94 EPS past 5Y 7.40%
#> 8 Dividend % 2.39% Quick Ratio 0.40 Sales past 5Y 6.60%
#> 9 Employees 150500 Current Ratio 1.20 Sales Q/Q -8.10%
#> 10 Optionable Yes Debt/Eq - EPS Q/Q 885.50%
#> 11 Shortable Yes LT Debt/Eq - Earnings Jul 26 BMO
#> 12 Recom 2.20 SMA20 -0.16% SMA50 8.14%
#> # ... with 6 more variables: X7 <chr>, X8 <chr>, X9 <chr>, X10 <chr>,
#> # X11 <chr>, X12 <chr>
由于标题在每个奇数列中,而每个偶数列中的数据,我们
通过绑定子集的行创建两列整齐的tibble。对于
那,我们生成奇数和偶数索引。然后,
purrr::map_dfr
允许我们迭代这两个列表,应用函数并对结果进行行绑定。该功能包括选择表[ ]
的2列,并使用set_names
重命名这两列。
col_num <- seq_len(ncol(tab))
even <- col_num[col_num %% 2 == 0]
odd <- setdiff(col_num, even)
tab2 <- map2_dfr(odd, even, ~ set_names(tab[, c(.x, .y)], c("header", "value")))
tab2
#> # A tibble: 72 x 2
#> header value
#> <chr> <chr>
#> 1 Index DJIA S&P500
#> 2 Market Cap 141.89B
#> 3 Income 7.12B
#> 4 Sales 90.90B
#> 5 Book/sh -3.34
#> 6 Cash/sh 17.26
#> 7 Dividend 5.68
#> 8 Dividend % 2.39%
#> 9 Employees 150500
#> 10 Optionable Yes
#> # ... with 62 more rows
你有一个漂亮的2列长表,包含你的所有数据。现在,如果你想 宽格式而不是长格式的表,你必须转置。 但首先,我们必须处理标题中的一些重复名称 柱。您不能重复列名称。
tab2 %>%
filter(header == header[duplicated(header)])
#> # A tibble: 2 x 2
#> header value
#> <chr> <chr>
#> 1 EPS next Y 10.71
#> 2 EPS next Y 7.28%
我们只是重命名添加_2
tab3 <- tab2 %>%
mutate(header = case_when(
duplicated(header) ~ paste(header, 2, sep = "_"),
TRUE ~ header)
)
# No more duplicates
any(duplicated(tab3$header))
#> [1] FALSE
tab3 %>% filter(stringr::str_detect(header, "EPS next Y"))
#> # A tibble: 2 x 2
#> header value
#> <chr> <chr>
#> 1 EPS next Y 10.71
#> 2 EPS next Y_2 7.28%
您可以传入宽格式并且有72列而不是72行。
tab3 %>%
spread(header, value)
#> # A tibble: 1 x 72
#> `52W High` `52W Low` `52W Range` ATR `Avg Volume` Beta `Book/sh`
#> * <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 -3.78% 87.78% 126.31 - 246.49 3.77 3.46M 1.18 -3.34
#> # ... with 65 more variables: `Cash/sh` <chr>, Change <chr>, `Current
#> # Ratio` <chr>, `Debt/Eq` <chr>, Dividend <chr>, `Dividend %` <chr>,
#> # Earnings <chr>, Employees <chr>, `EPS (ttm)` <chr>, `EPS next
#> # 5Y` <chr>, `EPS next Q` <chr>, `EPS next Y` <chr>, `EPS next
#> # Y_2` <chr>, `EPS past 5Y` <chr>, `EPS Q/Q` <chr>, `EPS this Y` <chr>,
#> # `Forward P/E` <chr>, `Gross Margin` <chr>, Income <chr>, Index <chr>,
#> # `Insider Own` <chr>, `Insider Trans` <chr>, `Inst Own` <chr>, `Inst
#> # Trans` <chr>, `LT Debt/Eq` <chr>, `Market Cap` <chr>, `Oper.
#> # Margin` <chr>, Optionable <chr>, `P/B` <chr>, `P/C` <chr>,
#> # `P/E` <chr>, `P/FCF` <chr>, `P/S` <chr>, Payout <chr>, PEG <chr>,
#> # `Perf Half Y` <chr>, `Perf Month` <chr>, `Perf Quarter` <chr>, `Perf
#> # Week` <chr>, `Perf Year` <chr>, `Perf YTD` <chr>, `Prev Close` <chr>,
#> # Price <chr>, `Profit Margin` <chr>, `Quick Ratio` <chr>, Recom <chr>,
#> # `Rel Volume` <chr>, ROA <chr>, ROE <chr>, ROI <chr>, `RSI (14)` <chr>,
#> # Sales <chr>, `Sales past 5Y` <chr>, `Sales Q/Q` <chr>, `Short
#> # Float` <chr>, `Short Ratio` <chr>, Shortable <chr>, `Shs Float` <chr>,
#> # `Shs Outstand` <chr>, SMA20 <chr>, SMA200 <chr>, SMA50 <chr>, `Target
#> # Price` <chr>, Volatility <chr>, Volume <chr>
想法:您还可以在标题列中用_
替换所有空格,以使列名不带空格。通常更容易处理。