我有一个数据框,我想将第一列的文本字符串分成两列,但仅在序列中的第二个空格之后。这是一个示例:
test22 Ticker
1 Current SharePrice $6.57 MFM
2 Current NAV $7.11 MFM
3 Current Premium/Discount -7.59% MFM
4 52WkAvg SharePrice $6.55 MFM
5 52WkAvg NAV $7.21 MFM
6 52WkAvg Premium/Discount -9.19% MFM
基本上,如果最终结果将是一个总共三列的数据框,并且price /%字段是其自己的单独列。谢谢!
答案 0 :(得分:0)
在基础r中的一个选项是使用,
创建定界符sub
,然后使用read.csv
:
out <- cbind(read.csv(text = sub(" (\\S+)$", ",\\1", df1$test22),
header = FALSE, stringsAsFactors = FALSE), df1[2])
out
#. V1 V2 Ticker
#1 Current SharePrice $6.57 MFM
#2 Current NAV $7.11 MFM
#3 Current Premium/Discount -7.59% MFM
#4 52WkAvg SharePrice $6.55 MFM
#5 52WkAvg NAV $7.21 MFM
#6 52WkAvg Premium/Discount -9.19% MFM
或使用extract
中的tidyr
library(tidyverse)
df1 %>%
extract(test22, into = c("V1", "V2"), "^(\\S+\\s+\\S+)\\s+(.*)")
# V1 V2 Ticker
#1 Current SharePrice $6.57 MFM
#2 Current NAV $7.11 MFM
#3 Current Premium/Discount -7.59% MFM
#4 52WkAvg SharePrice $6.55 MFM
#5 52WkAvg NAV $7.21 MFM
#6 52WkAvg Premium/Discount -9.19% MFM
df1 <- structure(list(test22 = c("Current SharePrice $6.57", "Current NAV $7.11",
"Current Premium/Discount -7.59%", "52WkAvg SharePrice $6.55",
"52WkAvg NAV $7.21", "52WkAvg Premium/Discount -9.19%"), Ticker = c("MFM",
"MFM", "MFM", "MFM", "MFM", "MFM")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
答案 1 :(得分:0)
这里是使用strsplit
data.frame(do.call(rbind, strsplit(df$test22, '\\s(?!.*\\s)', perl = TRUE)),
Ticker=df$Ticker)
# X1 X2 Ticker
# 1 Current SharePrice $6.57 MFM
# 2 Current NAV $7.11 MFM
# 3 Current Premium/Discount -7.59% MFM
# 4 52WkAvg SharePrice $6.55 MFM
# 5 52WkAvg NAV $7.21 MFM
# 6 52WkAvg Premium/Discount -9.19% MFM
或使用gsub
gsub('.*\\s.*?\\s(.*)','\\1', df$test22, perl = TRUE)
# [1] "$6.57" "$7.11" "-7.59%" "$6.55" "$7.21" "-9.19%"
# or if factors
# gsub('.*\\s.*?\\s(.*)','\\1', as.character(df$test22), perl = TRUE)
第二个字符的优点是它真正考虑了第二个空格字符(与最后一个空格相对)。
答案 2 :(得分:0)
以下是使用dplyr
和stringr
的选项:
library(dplyr)
library(stringr)
data <-
tibble(test22 = c("Current SharePrice $6.57",
"Current NAV $7.11",
"Current Premium/Discount -7.59%",
"52WkAvg SharePrice $6.55",
"52WkAvg NAV $7.21",
"52WkAvg Premium/Discount -9.19%"),
Ticker = "MFM")
data %>%
mutate(category = str_replace(test22, "^(.+ .+) (.+)$", "\\1"),
price_pc = str_replace(test22, "^(.+ .+) (.+)$", "\\2"))
# A tibble: 6 x 4
test22 Ticker category price_pc
<chr> <chr> <chr> <chr>
1 Current SharePrice $6.57 MFM Current SharePrice $6.57
2 Current NAV $7.11 MFM Current NAV $7.11
3 Current Premium/Discount -7.59% MFM Current Premium/Discount -7.59%
4 52WkAvg SharePrice $6.55 MFM 52WkAvg SharePrice $6.55
5 52WkAvg NAV $7.21 MFM 52WkAvg NAV $7.21
6 52WkAvg Premium/Discount -9.19% MFM 52WkAvg Premium/Discount -9.19%
编辑:使用的正则表达式的说明
忽略括号一秒钟:
^ =字符串的开头
。 =除换行外的任何字符
+ =至少一个前一个字符(在这种情况下,除换行之外的任何字符)
$ =字符串的结尾
所以"^(.+ .+) (.+)$"
寻找的字符串开头是:一些字符,然后是一个空格,然后是一些字符,然后是一个空格,然后是另外一些字符,然后是结束。
将括号添加为“捕获组”,这意味着查询“记住”由这些括号表示的字符串部分,并且可以通过引用括号的顺序来提取。因此,"\\1"
返回第一个括号捕获的内容,"\\2"
返回第二个括号捕获的内容。
Regexr是学习Regex的好资源。