Question

我正试图从字符串中提取坐标集并更改格式。

我已经尝试了一些stringr包并且无法使用模式提取。这是我第一次处理正则表达式，并且创建模式仍然有点混乱。

有一个数据框，其中一列包含一组或多组坐标。将Lat与Long分开的唯一模式（大多数）是（ - ），并且将一组坐标分隔为另一组坐标（/）

以下是一些数据的示例：

ID  Coordinates
1   3438-5150
2   3346-5108/3352-5120 East island, South port
3   West coast (284312 472254)
4   28.39.97-47.05.62/29.09.13-47.44.03
5   2843-4722/3359-5122(1H-2H-3H-4F)

大多数数据都是十进制的，例如（id 1是Lat 34.38 Lon 51.50），其他一些是00º00'00''，例如（id 4是Lat28º39'97''Lon47º05'62''）

我需要完成几个步骤

1 - 提取所有坐标集，为每个记录集创建一个新行;

2 - 将记录的文本标签提取到新列，将它们连接起来;

3-将坐标从00º00'00''（28.39.97）转换为00.0000º（28.6769 - 十进制dregree），使所有坐标的格式相同。如果它们是数字的话，我可以轻松转换。

4 - 添加点（。）以分隔十进制度值（从3438到34.38）并添加（ - ）以标识为（-34.38）西南半球。所有值必须有（ - ）符号。

我想尝试这样的事情：

步骤1和2 - 提取坐标集和名称

ID  x           y          label
1   3438        5150      
2   3346        5108      East island, South port
2   3352        5120      East island, South port
3   284312      472254    West coast
4   28.39.97    47.05.62    
4   29.09.13    47.44.03
5   2843        4722      1H-2H-3H-4F
5   3359        5122      1H-2H-3H-4F

第3步 - 将坐标格式转换为十进制度（ID 4）

ID  x           y       label
1   3438        5150    
2   3346        5108    East island, South port
2   3352        5120    East island, South port
3   284312      472254  West coast
4   286769      471005  
4   291536      470675
5   2843        4722      1H-2H-3H-4F
5   3359        5122      1H-2H-3H-4F

第4步 - 更改显示格式

ID   x          y         label
1   -34.38      -51.50    
2   -33.46      -51.08    East island, South port
2   -33.52      -51.20    East island, South port
3   -28.43      -47.22    West coast
4   -28.6769    -47.1005    
4   -29.1536    -47.0675
5   -28.43      -47.22    1H-2H-3H-4F
5   -33.59      -51.22    1H-2H-3H-4F

我编辑了这个问题，以便更好地澄清我的问题并改变我的一些需求。我意识到理解起来很麻烦。

那么，有没有人使用类似的东西？任何其他建议都会有很大的帮助。

再次感谢您抽空帮助。

Answer 1

注意：第一个答案解决问题的原始提问，最后一个答案解决其当前状态。应为每个解决方案适当设置data1中的数据。

鉴于您提供的数据和预期的输出（使用dplyr和tidyr），以下内容应解决您的第一个问题。

library(dplyr)
library(tidyr)

### Load Data
data1 <- structure(list(ID = 1:4, Coordinates = c("3438-5150", "3346-5108/3352-5120", 
"2843-4722/3359-5122(1H-2H-3H-4F)", "28.39.97-47.05.62/29.09.13-47.44.03"
)), .Names = c("ID", "Coordinates"), class = "data.frame", row.names = c(NA, 
-4L))

### This is a helper function to transform data that is like '1234'
### but should be '12.34', and leaves alone '12.34'.
### You may have to change this based on your use case.
div100 <- function(x) { return(ifelse(x > 100, x / 100, x)) }

### Remove items like "(...)" and change "12.34.56" to "12.34"
### Split into 4 columns and xform numeric value.
data1 %>%
    mutate(Coordinates = gsub('\\([^)]+\\)', '', Coordinates),
           Coordinates = gsub('(\\d+[.]\\d+)[.]\\d+', '\\1', Coordinates)) %>%
    separate(Coordinates, c('x.1', 'y.1', 'x.2', 'y.2'), fill = 'right', sep = '[-/]', convert = TRUE) %>%
    mutate_at(vars(matches('^[xy][.]')), div100) # xform columns x.N and y.N
##   ID   x.1   y.1   x.2   y.2
## 1  1 34.38 51.50    NA    NA
## 2  2 33.46 51.08 33.52 51.20
## 3  3 28.43 47.22 33.59 51.22
## 4  4 28.39 47.05 29.09 47.44

对mutate的调用会修改Coordinates两次以使替换变得更容易。

修改

使用其他正则表达式替换而非mutate_at的变体。

data1 %>% mutate(Coordinates = gsub('\$[^)]+\$', '', Coordinates), Coordinates = gsub('(\\d{2}[.]\\d{2})[.]\\d{2}', '\\1', Coordinates), Coordinates = gsub('(\\d{2})(\\d{2})', '\\1.\\2', Coordinates)) %>% separate(Coordinates, c('x.1', 'y.1', 'x.2', 'y.2'), fill = 'right', sep = '[-/]', convert = TRUE)

编辑2：以下解决方案解决了问题的更新版本

以下解决方案进行了大量转换以转换数据。这些是分开的，使它更容易思考（相对来说更容易）。

library(dplyr) library(tidyr) data1 <- structure(list(ID = 1:5, Coordinates = c("3438-5150", "3346-5108/3352-5120 East island, South port", "East coast (284312 472254)", "28.39.97-47.05.62/29.09.13-47.44.03", "2843-4722/3359-5122(1H-2H-3H-4F)")), .Names = c("ID", "Coordinates" ), class = "data.frame", row.names = c(NA, -5L)) ### Function for converting to numeric values and ### handles case of "12.34.56" (hours/min/sec) hms_convert <- function(llval) { nres <- rep(0, length(llval)) coord3_match_idx <- grepl('^\\d{2}[.]\\d{2}[.]\\d{2}$', llval) nres[coord3_match_idx] <- sapply(str_split(llval[coord3_match_idx], '[.]', 3), function(x) { sum(as.numeric(x) / c(1,60,3600))}) nres[!coord3_match_idx] <- as.numeric(llval[!coord3_match_idx]) nres } ### Each mutate works to transform the various data formats ### into a single format. The 'separate' commands then split ### the data into the appropriate columns. The action of each ### 'mutate' can be seen by progressively viewing the results ### (i.e. adding one 'mutate' command at a time). data1 %>% mutate(Coordinates_new = Coordinates) %>% mutate(Coordinates_new = gsub('\$[^) ]+\$', '', Coordinates_new)) %>% mutate(Coordinates_new = gsub('(.*?)\$((\\d{6})[ ](\\d{6}))\$.*', '\\3-\\4 \\1', Coordinates_new)) %>% mutate(Coordinates_new = gsub('(\\d{2})(\\d{2})(\\d{2})', '\\1.\\2.\\3', Coordinates_new)) %>% mutate(Coordinates_new = gsub('(\\S+)[\\s]+(.+)', '\\1|\\2', Coordinates_new, perl = TRUE)) %>% separate(Coordinates_new, c('Coords', 'label'), fill = 'right', sep = '[|]', convert = TRUE) %>% mutate(Coords = gsub('(\\d{2})(\\d{2})', '\\1.\\2', Coords)) %>% separate(Coords, c('x.1', 'y.1', 'x.2', 'y.2'), fill = 'right', sep = '[-/]', convert = TRUE) %>% mutate_at(vars(matches('^[xy][.]')), hms_convert) %>% mutate_at(vars(matches('^[xy][.]')), function(x) ifelse(!is.na(x), -x, x)) ## ID Coordinates x.1 y.1 x.2 y.2 label ## 1 1 3438-5150 -34.38000 -51.50000 NA NA <NA> ## 2 2 3346-5108/3352-5120 East island, South port -33.46000 -51.08000 -33.52000 -51.20000 East island, South port ## 3 3 East coast (284312 472254) -28.72000 -47.38167 NA NA East coast ## 4 4 28.39.97-47.05.62/29.09.13-47.44.03 -28.67694 -47.10056 -29.15361 -47.73417 <NA> ## 5 5 2843-4722/3359-5122(1H-2H-3H-4F) -28.43000 -47.22000 -33.59000 -51.22000 <NA>

Answer 2

我们可以使用stringi。我们使用.在4位数字之间创建gsub，使用stri_extract_all（来自stringi）提取两位数后跟一个点后跟两位数字（{ {1}}）获得\\d{2}\\.\\d{2}输出。由于list元素的长度不相等，我们可以在末尾为那些长度比最大长度短的元素填充NA并转换为list（使用matrix）。转换为stri_list2matrix后，将data.frame列更改为character，将numeric更改为原始数据集的“ID”列。

cbind

但是，这也可以通过library(stringi) d1 <- as.data.frame(stri_list2matrix(stri_extract_all_regex(gsub("(\\d{2})(\\d{2})", "\\1.\\2", data1$Coordinates), "\\d{2}\\.\\d{2}"), byrow=TRUE), stringsAsFactors=FALSE) d1[] <- lapply(d1, as.numeric) colnames(d1) <- paste0(c("x.", "y."), rep(1:2,each = 2)) cbind(data1[1], d1) # ID x.1 y.1 x.2 y.2 #1 1 34.38 51.50 NA NA #2 2 33.46 51.08 33.52 51.20 #3 3 28.43 47.22 33.59 51.22 #4 4 28.39 47.05 29.09 47.44来完成。

base R

R组坐标从字符串

2 个答案: