我有以下数据框:
library(rvest)
library(XML)
library(tidyr)
library(zoo)
library(chron)
library(lubridate)
library(stringr)
page.201702050atl = read_html("http://www.pro-football-reference.com/boxscores/201702050atl.htm")
comments.201702050atl = page.201702050atl %>% html_nodes(xpath = "//comment()")
pbp.201702050atl = comments.201702050atl[45] %>% html_text() %>% read_html() %>% html_node("#pbp") %>% html_table()
colnames(pbp.201702050atl) = c('Quarter', 'Time', 'Down', 'ToGo', 'Location', 'Detail', 'Away.Score', 'Home.Score', 'EPB', 'EPA', 'Win.pct')
pbp.201702050atl.a = pbp.201702050atl[-union(which(pbp.201702050atl$Quarter == '1st Quarter'), which(pbp.201702050atl$Quarter == 'Quarter')), ]
pbp.201702050atl.b = pbp.201702050atl.a[-union(which(pbp.201702050atl.a$Quarter == '2nd Quarter'), which(pbp.201702050atl.a$Quarter == '3rd Quarter')), ]
pbp.201702050atl.c = pbp.201702050atl.b[-union(which(pbp.201702050atl.b$Quarter == '4th Quarter'), which(pbp.201702050atl.b$Quarter == 'Overtime')), ]
pbp.201702050atl.d = pbp.201702050atl.c[-which(pbp.201702050atl.c$Quarter == 'End of Overtime'), ]
我想创建一个新的数据框,将pbp.201702050atl.d $ Location拆分为两列,以便字符元素组成一个,而数字元素组成另一个,如下所示:
V1 V2
1 "ATL" "35"
2 "NWE" "25"
3 "NWE" "34"
4 "NWE" "34"
5 "NWE" "34"
6 "NWE" "34"
7 "ATL" "34"
8 "ATL" "34"
9 "ATL" "34"
10 "" "50"
...
为此,我写了:
Location.201702050atl = as.data.frame(str_split_fixed(as.character(pbp.201702050atl.d$Location), boundary("word"), n = 2))
虽然接近我想要的,但这个功能导致:
V1 V2
1 "ATL" "35"
2 "NWE" "25"
3 "NWE" "34"
4 "NWE" "34"
5 "NWE" "34"
6 "NWE" "34"
7 "ATL" "34"
8 "ATL" "34"
9 "ATL" "34"
10 "50" ""
...
通知地点.201702050atl [10,]。如果对于该行,原始列包含由空格分隔的两组字符,则此函数仅将字符放置在Location.201702050atl $ V2中。相反,我想在Location.201702050atl $ V1中放置类似的(文本)字符,在Location.201702050atl $ V2中放置类似的(数字)字符。当整个列实际上必须格式化相同的格式时,如何根据其字符的自然格式拆分一列的元素?考虑到其组成字符的自然格式?非常感谢您的帮助,谢谢。
答案 0 :(得分:1)
如果我理解正确,也许这可以帮助
library(data.table)
DT <- data.table(C1=replicate(10, paste0(sample(99,1), paste0(sample(LETTERS,2), collapse = "")) ) )
# Simulating a white space
DT$C1[10] <- "84 ME"
DT
C1
1: 38XT
2: 29XL
3: 24XH
4: 14SC
5: 34SY
6: 80WB
7: 23VB
8: 23WR
9: 19KJ
10: 84 ME
DT[, `:=` (C1_1 = gsub("[\\d]", "", C1, perl = T), C1_2 = gsub("[^\\d]", "", C1, perl = T)) ]
DT
C1 C1_1 C1_2
1: 38XT XT 38
2: 29XL XL 29
3: 24XH XH 24
4: 14SC SC 14
5: 34SY SY 34
6: 80WB WB 80
7: 23VB VB 23
8: 23WR WR 23
9: 19KJ KJ 19
10: 84 ME ME 84
如果您需要删除原始列,则可以
DT[, C1:=NULL]
请注意,此正则表达式将删除第一个中的所有数字,以及第二个中的所有非数字。这不会考虑订单。例如,D7M8
将返回DM
和78
。