表中的字符串替换

时间:2017-06-16 15:28:33

标签: r

我正在尝试获取CoinMarketCap价格图表。我做了:

url_cmp <- "https://coinmarketcap.com/currencies/views/all/"
library(rvest)

url_cmp %>%
read_html() %>%
html_nodes(css = "table") %>%
html_table() -> "tbl_cmp"

现在我已经整个桌子了,想要清理它。我想从表格中删除所有$%,\n个字符。我试过了:

stringr::str_replace_all(string = tbl_cmp, pattern = "\\\n|\\s|[%*$,]", replacement = "")

gsub(pattern = "\\\n|\\s|[%*$,]", replacement = "", x = df_cmp)

两者都打算替换,但不再保留表格格式;我得到一根长串。我知道str_replace()gsub()都将字符串作为输入。表有解决方法吗?

1 个答案:

答案 0 :(得分:1)

new_df <- tbl_cmp[[1]] %>% sapply(gsub,pattern = "\\\n|\\s|[%*$,]", replacement = "") %>% as.data.frame(stringsAsFactors = FALSE)
num_cols <- names(new_df)[-(2:3)]
conv_col_to_num <- function(x){if(x %in% num_cols) new_df[[x]] %>% as.numeric %>% data.frame else new_df[[x]] %>% data.frame}
new_df_num <-
  new_df %>% 
  names %>% 
  lapply(conv_col_to_num) %>%
  do.call(cbind,.) %>%
  setNames(names(new_df))

# > head(new_df_num)
#   #            Name Symbol  Market Cap       Price Circulating Supply Volume (24h)  % 1h % 24h   % 7d
# 1 1         Bitcoin    BTC 40690438752 2481.970000           16394412   1406060000  0.03  8.06 -12.62
# 2 2        Ethereum    ETH 33960266690  367.000000           92535795   1554170000 -0.02 15.09  35.91
# 3 3          Ripple    XRP 10036645930    0.262120        38290271363    109902000 -0.04  4.58  -9.87
# 4 4             NEM    XEM  1784268000    0.198252         8999999999      7966520  0.60 13.42 -10.20
# 5 5 EthereumClassic    ETC  1687441250   18.210000           92656987    108333000 -0.28  8.45   4.40
# 6 6        Litecoin    LTC  1649935702   31.990000           51575157    365949000  2.24 14.14   6.57

# > str(new_df_num)
# 'data.frame':  754 obs. of  10 variables:
#   $ #                 : num  1 2 3 4 5 6 7 8 9 10 ...
#   $ Name              : Factor w/ 751 levels "1337","1CRedit",..: 74 245 558 446 246 395 179 358 90 616 ...
# $ Symbol            : Factor w/ 751 levels "","1337","1CR",..: 98 236 718 695 235 377 169 402 106 579 ...
# $ Market Cap        : num  4.07e+10 3.40e+10 1.00e+10 1.78e+09 1.69e+09 ...
# $ Price             : num  2481.97 367 0.262 0.198 18.21 ...
# $ Circulating Supply: num  1.64e+07 9.25e+07 3.83e+10 9.00e+09 9.27e+07 ...
# $ Volume (24h)      : num  1.41e+09 1.55e+09 1.10e+08 7.97e+06 1.08e+08 ...
# $ % 1h              : num  0.03 -0.02 -0.04 0.6 -0.28 2.24 0.09 1.5 -0.29 -0.71 ...
# $ % 24h             : num  8.06 15.09 4.58 13.42 8.45 ...
# $ % 7d              : num  -12.62 35.91 -9.87 -10.2 4.4 ...

注意: 我添加了代码,最终得到了格式正确的data.frame(带有数字列)。

我试图与管道更加一致,并用sapply代替申请(见评论)。

我认为(我不确定)在输入处应用转换为矩阵,而只在输出处进行转换,因此如果我的函数必须使用数字(它没有),则应用将失败。

“?”转为NAs,因此警告