如何在R中有条件地拆分字符串?

时间:2014-03-20 21:15:22

标签: r

我想根据许多条件将字符串拆分为多个列。

我的数据示例:

Col1<- c("01/05/2004 02:59", "01/05/2004 05:04", "01/06/2004 07:19", "01/07/2004 02:55", "01/07/2004 04:32", "01/07/2004 04:38", "01/07/2004 17:13", "01/07/2004 18:40", "01/07/2004 20:58", "01/07/2004 23:39", "01/09/2004 13:28")

Col2<- c("Wabamun #4 off line.", "Keephills #2 on line.", "Wabamun #1 on line.", "North Red Deer T217s bus lock out.  Under investigation.",  "T217s has blown CTs on 778L", "T217s North Red Deer bus back in service (778L out of service)", "Keephills #2 off line.", "Wabamun #4 on line.", "Sundance #1 off line.", "Keephills #2 on line", "Homeland security event lowered to yellow ( elevated)")

df<- data.frame(Col1,Col2)

我希望能够有条件地拆分列w。

得到这样的东西:

Col3<- c("Wabamun #4", "Keephills #2", "Wabamun #1", "General Asset", "General Asset", "General Asset", "Keephills #2", "Wabamun #4", "Sundance #1", "Keephills #2", "General Asset") 

Col4<- c("off line.", "on line.", "on line.", "North Red Deer T217s bus lock out.  Under investigation.",  "T217s has blown CTs on 778L", "T217s North Red Deer bus back in service (778L out of service)", "off line.", "on line.", "off line.", "on line", "Homeland security event lowered to yellow ( elevated)")

在我计划找到资产出现故障并重新上线之间的时间之后。这些通常是发电厂,所以我也会查看工厂的容量。示例Keephills#2的容量为300MW。

2 个答案:

答案 0 :(得分:1)

值得庆幸的是,正则表达式可以节省一天。

# This line prevents character strings turning into factors
df<- data.frame(Col1,Col2, stringsAsFactors=FALSE)

# This match works with the powerplant names as 
# they're all 1 or more characters followed by a space, hash and single digit.
pwrmatch <- regexpr("^[[:alpha:]]+ #[[:digit:]]", df$Col2)
df$Col3 <- "General Asset"
df$Col3[grepl("^[[:alpha:]]+ #[[:digit:]]", df$Col2)] <- regmatches(df$Col2, pwrmatch)

Col3现在看起来像是:c("Wabamun #4", "Keephills #2", "Wabamun #1", "General Asset", "General Asset", "General Asset", "Keephills #2", "Wabamun #4", "Sundance #1", "Keephills #2", "General Asset")

另一条线是类似的问题,只是匹配所有开/关线的情况。

linematch <- regexpr("(on|off) line", df$Col2)
df$Col4 <- df$Col2
df$Col4[grepl("(on|off) line", df$Col2)] <- regmatches(df$Col2, linematch)

Col4现在看起来像是:c("off line", "on line", "on line", "North Red Deer T217s bus lock out. Under investigation.", "T217s has blown CTs on 778L", "T217s North Red Deer bus back in service (778L out of service)", "off line", "on line", "off line", "on line", "Homeland security event lowered to yellow ( elevated)" )

答案 1 :(得分:0)

> Col3 <- Col4 <- character(nrow(df))
> index <- grep("#", Col2, invert = TRUE)
> spl1 <- unlist(strsplit(Col2[-index], " o"))[c(TRUE, FALSE)]
> Col3[-index] <- spl1
> Col3[index] <- "General Asset"
> spl2 <- unlist(strsplit(Col2[-index], " o"))[c(FALSE, TRUE)]
> Col4[-index] <- paste("o", spl2, sep="")
> Col4[index] <- Col2[index]
> Col3
## [1] "Wabamun #4"    "Keephills #2"  "Wabamun #1"    "General Asset"
## [5] "General Asset" "General Asset" "Keephills #2"  "Wabamun #4"   
## [9] "Sundance #1"   "Keephills #2"  "General Asset"
> Col4
##  [1] "off line."                                                     
##  [2] "on line."                                                      
##  [3] "on line."                                                      
##  [4] "North Red Deer T217s bus lock out.  Under investigation."      
##  [5] "T217s has blown CTs on 778L"                                   
##  [6] "T217s North Red Deer bus back in service (778L out of service)"
##  [7] "off line."                                                     
##  [8] "on line."                                                      
##  [9] "off line."                                                     
## [10] "on line"                                                       
## [11] "Homeland security event lowered to yellow ( elevated)"