基于第二个空白的条件数据帧字符串拆分

时间:2019-01-15 13:41:24

标签: r string tidyr data-cleaning

我有一个数据框,我想将第一列的文本字符串分成两列,但仅在序列中的第二个空格之后。这是一个示例:

                           test22 Ticker
1        Current SharePrice $6.57    MFM
2               Current NAV $7.11    MFM
3 Current Premium/Discount -7.59%    MFM
4        52WkAvg SharePrice $6.55    MFM
5               52WkAvg NAV $7.21    MFM
6 52WkAvg Premium/Discount -9.19%    MFM

基本上,如果最终结果将是一个总共三列的数据框,并且price /%字段是其自己的单独列。谢谢!

3 个答案:

答案 0 :(得分:0)

在基础中的一个选项是使用,创建定界符sub,然后使用read.csv

out <- cbind(read.csv(text = sub(" (\\S+)$", ",\\1", df1$test22), 
       header = FALSE, stringsAsFactors = FALSE), df1[2])
out
#.                       V1     V2 Ticker
#1       Current SharePrice  $6.57    MFM
#2              Current NAV  $7.11    MFM
#3 Current Premium/Discount -7.59%    MFM
#4       52WkAvg SharePrice  $6.55    MFM
#5              52WkAvg NAV  $7.21    MFM
#6 52WkAvg Premium/Discount -9.19%    MFM

或使用extract中的tidyr

library(tidyverse)
df1 %>% 
     extract(test22, into = c("V1", "V2"), "^(\\S+\\s+\\S+)\\s+(.*)")
#                        V1     V2 Ticker
#1       Current SharePrice  $6.57    MFM
#2              Current NAV  $7.11    MFM
#3 Current Premium/Discount -7.59%    MFM
#4       52WkAvg SharePrice  $6.55    MFM
#5              52WkAvg NAV  $7.21    MFM
#6 52WkAvg Premium/Discount -9.19%    MFM

数据

df1 <- structure(list(test22 = c("Current SharePrice $6.57", "Current NAV $7.11", 
  "Current Premium/Discount -7.59%", "52WkAvg SharePrice $6.55", 
 "52WkAvg NAV $7.21", "52WkAvg Premium/Discount -9.19%"), Ticker = c("MFM", 
 "MFM", "MFM", "MFM", "MFM", "MFM")), class = "data.frame", row.names = c("1", 
  "2", "3", "4", "5", "6"))

答案 1 :(得分:0)

这里是使用strsplit

的选项
data.frame(do.call(rbind, strsplit(df$test22, '\\s(?!.*\\s)', perl = TRUE)), 
           Ticker=df$Ticker)
#                         X1     X2 Ticker
# 1       Current SharePrice  $6.57    MFM
# 2              Current NAV  $7.11    MFM
# 3 Current Premium/Discount -7.59%    MFM
# 4       52WkAvg SharePrice  $6.55    MFM
# 5              52WkAvg NAV  $7.21    MFM
# 6 52WkAvg Premium/Discount -9.19%    MFM

或使用gsub

gsub('.*\\s.*?\\s(.*)','\\1', df$test22, perl = TRUE)
# [1] "$6.57"  "$7.11"  "-7.59%" "$6.55"  "$7.21"  "-9.19%"
# or if factors
# gsub('.*\\s.*?\\s(.*)','\\1', as.character(df$test22), perl = TRUE)

第二个字符的优点是它真正考虑了第二个空格字符(与最后一个空格相对)。

答案 2 :(得分:0)

以下是使用dplyrstringr的选项:

library(dplyr)
library(stringr)

data <-
  tibble(test22 = c("Current SharePrice $6.57",
                    "Current NAV $7.11",
                    "Current Premium/Discount -7.59%",
                    "52WkAvg SharePrice $6.55",
                    "52WkAvg NAV $7.21",
                    "52WkAvg Premium/Discount -9.19%"),
         Ticker = "MFM")

data %>% 
  mutate(category = str_replace(test22, "^(.+ .+) (.+)$", "\\1"),
         price_pc = str_replace(test22, "^(.+ .+) (.+)$", "\\2"))


# A tibble: 6 x 4
test22                          Ticker category                 price_pc
<chr>                           <chr>  <chr>                    <chr>   
1 Current SharePrice $6.57        MFM    Current SharePrice       $6.57   
2 Current NAV $7.11               MFM    Current NAV              $7.11   
3 Current Premium/Discount -7.59% MFM    Current Premium/Discount -7.59%  
4 52WkAvg SharePrice $6.55        MFM    52WkAvg SharePrice       $6.55   
5 52WkAvg NAV $7.21               MFM    52WkAvg NAV              $7.21   
6 52WkAvg Premium/Discount -9.19% MFM    52WkAvg Premium/Discount -9.19% 

编辑:使用的正则表达式的说明

忽略括号一秒钟:

^ =字符串的开头

=除换行外的任何字符

+ =至少一个前一个字符(在这种情况下,除换行之外的任何字符)

$ =字符串的结尾

所以"^(.+ .+) (.+)$"寻找的字符串开头是:一些字符,然后是一个空格,然后是一些字符,然后是一个空格,然后是另外一些字符,然后是结束。

将括号添加为“捕获组”,这意味着查询“记住”由这些括号表示的字符串部分,并且可以通过引用括号的顺序来提取。因此,"\\1"返回第一个括号捕获的内容,"\\2"返回第二个括号捕获的内容。

Regexr是学习Regex的好资源。