根据R

时间:2017-03-10 09:40:17

标签: r split strsplit

我有以下数据框:

library(rvest)
library(XML)
library(tidyr)
library(zoo)
library(chron)
library(lubridate)
library(stringr)
page.201702050atl = read_html("http://www.pro-football-reference.com/boxscores/201702050atl.htm")
comments.201702050atl = page.201702050atl %>% html_nodes(xpath = "//comment()")
pbp.201702050atl = comments.201702050atl[45] %>% html_text() %>% read_html() %>% html_node("#pbp") %>% html_table()
colnames(pbp.201702050atl) = c('Quarter', 'Time', 'Down', 'ToGo', 'Location', 'Detail', 'Away.Score', 'Home.Score', 'EPB', 'EPA', 'Win.pct')
pbp.201702050atl.a = pbp.201702050atl[-union(which(pbp.201702050atl$Quarter == '1st Quarter'), which(pbp.201702050atl$Quarter == 'Quarter')), ]
pbp.201702050atl.b = pbp.201702050atl.a[-union(which(pbp.201702050atl.a$Quarter == '2nd Quarter'), which(pbp.201702050atl.a$Quarter == '3rd Quarter')), ]
pbp.201702050atl.c = pbp.201702050atl.b[-union(which(pbp.201702050atl.b$Quarter == '4th Quarter'), which(pbp.201702050atl.b$Quarter == 'Overtime')), ]
pbp.201702050atl.d = pbp.201702050atl.c[-which(pbp.201702050atl.c$Quarter == 'End of Overtime'), ]

我想创建一个新的数据框,将pbp.201702050atl.d $ Location拆分为两列,以便字符元素组成一个,而数字元素组成另一个,如下所示:

     V1    V2
1    "ATL" "35"
2    "NWE" "25"
3    "NWE" "34"
4    "NWE" "34"
5    "NWE" "34"
6    "NWE" "34"
7    "ATL" "34"
8    "ATL" "34"
9    "ATL" "34"
10   ""    "50"
...

为此,我写了:

Location.201702050atl = as.data.frame(str_split_fixed(as.character(pbp.201702050atl.d$Location), boundary("word"), n = 2))

虽然接近我想要的,但这个功能导致:

     V1    V2
1    "ATL" "35"
2    "NWE" "25"
3    "NWE" "34"
4    "NWE" "34"
5    "NWE" "34"
6    "NWE" "34"
7    "ATL" "34"
8    "ATL" "34"
9    "ATL" "34"
10   "50"  ""
...

通知地点.201702050atl [10,]。如果对于该行,原始列包含由空格分隔的两组字符,则此函数仅将字符放置在Location.201702050atl $ V2中。相反,我想在Location.201702050atl $ V1中放置类似的(文本)字符,在Location.201702050atl $ V2中放置类似的(数字)字符。当整个列实际上必须格式化相同的格式时,如何根据其字符的自然格式拆分一列的元素?考虑到其组成字符的自然格式?非常感谢您的帮助,谢谢。

1 个答案:

答案 0 :(得分:1)

如果我理解正确,也许这可以帮助

library(data.table)
DT <- data.table(C1=replicate(10, paste0(sample(99,1), paste0(sample(LETTERS,2), collapse = "")) ) )
# Simulating a white space
DT$C1[10] <- "84 ME"
DT
    C1
 1:  38XT
 2:  29XL
 3:  24XH
 4:  14SC
 5:  34SY
 6:  80WB
 7:  23VB
 8:  23WR
 9:  19KJ
10: 84 ME
DT[, `:=` (C1_1 = gsub("[\\d]", "", C1, perl = T), C1_2 = gsub("[^\\d]", "", C1, perl = T)) ]
DT
       C1 C1_1 C1_2
 1:  38XT   XT   38
 2:  29XL   XL   29
 3:  24XH   XH   24
 4:  14SC   SC   14
 5:  34SY   SY   34
 6:  80WB   WB   80
 7:  23VB   VB   23
 8:  23WR   WR   23
 9:  19KJ   KJ   19
10: 84 ME   ME   84

如果您需要删除原始列,则可以

DT[, C1:=NULL]

请注意,此正则表达式将删除第一个中的所有数字,以及第二个中的所有非数字。这不会考虑订单。例如,D7M8将返回DM78