R regex / gsub:提取模式的一部分

时间:2013-02-05 05:32:42

标签: regex r grep gsub

我有一份气象站列表及其纬度和经度位置。存在格式问题,其中一些有小时和分钟,而其他有小时,分钟和秒。我可以使用正则表达式找到模式,但是我无法提取各个部分。

以下是数据:

> head(wthrStat1 )
     Station       lat        lon
1940    K01R    31-08N    092-34W
1941    K01T    28-08N    094-24W
1942    K03Y    48-47N    096-57W
1943    K04V 38-05-50N 106-10-07W
1944    K05F 31-25-16N 097-47-49W
1945    K06D 48-53-04N 099-37-15W

我想要这样的事情:

   Station       latHr latMin   latSec  latDir   lonHr lonMin  lonSec lonDir
    1940    K01R    31    08       00      N      092   34       00     W
    1941    K01T    28    08       00      N      094   24       00     W
    1942    K03Y    48    47       00      N      096   57       00     W
    1943    K04V    38    05       50      N      106   10       07     W
    1944    K05F    31    25       16      N      097   47       49     W
    1945    K06D    48    53       04      N      099   37       15     W

我可以获得这个正则表达式的匹配:

data.format <- "\\d{1,3}-\\d{1,3}(?:-\\d{1,3})?[NSWE]{1}"
grep(data.format, wthrStat1$lat)

但我不确定如何将各个部分分成列。我尝试过以下几件事:

wthrStat1$latHr <- ifelse(grepl(data.format, wthrStat1$lat), gsub(????), NA)

但没有运气。

这是一个dput():

> dput(wthrStat1[1:10,] )
structure(list(Station = c("K01R", "K01T", "K03Y", "K04V", "K05F", 
"K06D", "K07G", "K07S", "K08D", "K0B9"), lat = c("31-08N", "28-08N", 
"48-47N", "38-05-50N", "31-25-16N", "48-53-04N", "42-34-28N", 
"47-58-27N", "48-18-03N", "43-20N"), lon = c("092-34W", "094-24W", 
"096-57W", "106-10-07W", "097-47-49W", "099-37-15W", "084-48-41W", 
"117-25-42W", "102-24-23W", "070-24W")), .Names = c("Station", 
"lat", "lon"), row.names = 1940:1949, class = "data.frame")

有什么建议吗?

3 个答案:

答案 0 :(得分:7)

gsubfn包中的

strapplyc将提取括号括起来的正则表达式中的每个组:

library(gsubfn)
data.format <- "(\\d{1,3})-(\\d{1,3})-?(\\d{1,3})?([NSWE]{1})"
parts <- strapplyc(wthrStat1$lat, data.format, simplify = rbind)
parts[parts == ""] <- "00"

给出:

> parts
      [,1] [,2] [,3] [,4]
 [1,] "31" "08" "00" "N" 
 [2,] "28" "08" "00" "N" 
 [3,] "48" "47" "00" "N" 
 [4,] "38" "05" "50" "N" 
 [5,] "31" "25" "16" "N" 
 [6,] "48" "53" "04" "N" 
 [7,] "42" "34" "28" "N" 
 [8,] "47" "58" "27" "N" 
 [9,] "48" "18" "03" "N" 
[10,] "43" "20" "00" "N" 

答案 1 :(得分:6)

这是非常低效的,我希望其他人有更好的解决方案:

dat <- read.table(text ='   Station       lat        lon
1940    K01R    31-08N    092-34W
1941    K01T    28-08N    094-24W
1942    K03Y    48-47N    096-57W
1943    K04V 38-05-50N 106-10-07W
1944    K05F 31-25-16N 097-47-49W
1945    K06D 48-53-04N 099-37-15W', head=T)


pattern <- '([0-9]+)[-]([0-9]+)([-|A-Z]+)([0-9]*)([A-Z]*)'

dat$latHr <- gsub(pattern,'\\1',dat$lat)
dat$latMin    <- gsub(pattern,'\\2',dat$lat)

latSec    <- gsub(pattern,'\\4',dat$lat)
latSec[nchar(latSec)==0] <- '00'
dat$latSec <- latSec

latDir <- gsub(pattern,'\\5',dat$lat)
latDir[nchar(latDir)==0] <- latDir[nchar(latDir)!=0][1]
dat$latDir <- latDir

dat
     Station       lat        lon latHr latMin latSec latDir
1940    K01R    31-08N    092-34W    31     08     00      N
1941    K01T    28-08N    094-24W    28     08     00      N
1942    K03Y    48-47N    096-57W    48     47     00      N
1943    K04V 38-05-50N 106-10-07W    38     05     50      N
1944    K05F 31-25-16N 097-47-49W    31     25     16      N
1945    K06D 48-53-04N 099-37-15W    48     53     04      N

答案 2 :(得分:2)

另一个答案,使用stringr

# example data
data <-
"Station       lat        lon
1940    K01R    31-08N    092-34W
1941    K01T    28-08N    094-24W
1942    K03Y    48-47N    096-57W
1943    K04V 38-05-50N 106-10-07W
1944    K05F 31-25-16N 097-47-49W
1945    K06D 48-53-04N 099-37-15W"

## read string into a data.frame
df <- read.table(text=data, head=T, stringsAsFactors=F)

pattern <- "(\\d{1,3})-(\\d{1,3})(?:-(\\d{1,3}))?([NSWE]{1})"

library(stringr)
str_match(df$lat, pattern)

这将生成一个data.frame,其中一列用于整个匹配字符串,另一列用于每个捕获组。

     [,1]        [,2] [,3] [,4]  [,5]
[1,] "31-08N"    "31" "08" ""    "N" 
[2,] "28-08N"    "28" "08" ""    "N" 
[3,] "48-47N"    "48" "47" ""    "N" 
[4,] "38-05-50N" "38" "05" "-50" "N" 
[5,] "31-25-16N" "31" "25" "-16" "N" 
[6,] "48-53-04N" "48" "53" "-04" "N"

R string processing能力在过去几年中取得了很大进展。