在R中使用正则表达式和tidyr在第一个匹配实例上拆分列变量

时间:2017-01-04 23:44:16

标签: r regex tidyr

尝试在变量中有多个空格的R数据框中拆分列,但我想在第一个空格上拆分。示例数据框:

template<auto value>
struct MyStruct {};

template<typename Class, typename Result, Result Class::* value>
struct MyStruct<value> {
    // add members using Class, Result, and value here
    using containing_type = Class;
};

typename MyStruct<&Something::theotherthing>::containing_type x = Something();

我正在尝试使用tidyr在第一个空格中拆分df'date'列,以便日期在它自己的列中:

df <- data.frame(game = c(1, 2, 3, 4, 5, 6), date = c("Monday Apr 3", "Tuesday Apr 4", "Wednesday Apr 5", "Thursday Apr 6", "Friday Apr 7", "Saturday Apr 8"))

以上是问题所在。以下是我尝试过的以及出了什么问题。

通过tidyr文档,'sep'的默认值是'一个匹配任何非字母数字值序列的正则表达式。'所以,如果我这样做:

  game       day date
1    1    Monday  Apr 3
2    2   Tuesday  Apr 4
3    3 Wednesday  Apr 5
4    4  Thursday  Apr 6
5    5    Friday  Apr 7
6    6  Saturday  Apr 8

那将在空间上分裂,但它会在两个空格上分开(例如'星期一'之后的空格和'星期一4月3''4月'之后的空格)。结果是:

df %>% separate(date, c("day", "date"))

我可以添加正则表达式来选择第一个空格(我检查了这个正则表达式在Sublime Text中工作):

  game       day date
1    1    Monday  Apr
2    2   Tuesday  Apr
3    3 Wednesday  Apr
4    4  Thursday  Apr
5    5    Friday  Apr
6    6  Saturday  Apr
Warning message:
Too many values at 6 locations: 1, 2, 3, 4, 5, 6 

但这给了我:

df %>% separate(date, c("day", "date"), sep='^[^\\s]*\\K\\s')

那么出了什么问题?或者我如何使这项工作?或者我明白不明白的是什么?

3 个答案:

答案 0 :(得分:9)

您需要将extra参数指定为merge

library(tidyr)
df %>% separate(date, c("day", "date"), extra = "merge")

#  game       day  date
#1    1    Monday Apr 3
#2    2   Tuesday Apr 4
#3    3 Wednesday Apr 5
#4    4  Thursday Apr 6
#5    5    Friday Apr 7
#6    6  Saturday Apr 8

答案 1 :(得分:1)

Psidom为您提供有关太多值的第一条警告信息。关于您的第二种方法,您最终得到的值太少,部分原因是\\K不能与stringi一起使用,separate正在使用stringi::stri_split_regex(df$date, '^[^\\s]*\\K\\s')。您可以使用sep自行查看。因此,您不会使用该正则表达式进行任何拆分,并且最终会得到关于值太少的警告消息。

您可以将# a space not followed by a digit df %>% separate(date, c("day", "date"), sep = "\\s(?!\\d)") # game day date #1 1 Monday Apr 3 #2 2 Tuesday Apr 4 #3 3 Wednesday Apr 5 #4 4 Thursday Apr 6 #5 5 Friday Apr 7 #6 6 Saturday Apr 8 指定为

\\K

一些替代正则表达式:

你不能使用# a space preceded by 3 - 6 characters and "day". # 3 - 6 characters allows "Monday" and "Wednesday" "(?<=.{3,6}day)\\s" # same idea "(?<=\\S{3,6}day)\\s" # same idea "(?<=.?.?.?...day)\\s" # same idea, but using ^ to anchor and not using "day" "(?<=^\\S{0,9})\\s" # space followed by some other characters, a space, digit(s) and the end of the line "\\s(?=.+\\s\\d+$)" ,但如果你需要使用可变长度的后视,量词需要有界限:

{{1}}

答案 2 :(得分:1)

我们可以使用base R

轻松完成此操作
cbind(df[1], read.csv(text=sub("\\s+", ",", df$date),
             header=FALSE, col.names = c("day", "date")))
#  game       day  date
#1    1    Monday Apr 3
#2    2   Tuesday Apr 4
#3    3 Wednesday Apr 5
#4    4  Thursday Apr 6
#5    5    Friday Apr 7
#6    6  Saturday Apr 8

或其他选项extract来自tidyr

library(tidyr)
extract(df, date, into = c("day", "date"), "(\\S+)\\s+(.*)")
#   game       day  date
#1    1    Monday Apr 3
#2    2   Tuesday Apr 4
#3    3 Wednesday Apr 5
#4    4  Thursday Apr 6
#5    5    Friday Apr 7
#6    6  Saturday Apr 8