将Regex合并为TIDYR的分隔符

时间:2017-04-08 13:29:46

标签: r regex

我正在尝试重构我的数据,并且它会挂起某些单元格字符串,所以我想调整它以使分隔符只是以下情况:

-End of word
-“;”
-1 space
-Capitol Letter

我是REGEX的新手,但这似乎捕获了我正在寻找的东西:

";\s[A-Z]"

但是,它还包括第二个单词的第一个字母,我不想成为分隔符的一部分。而且我不确定如何将它合并到我的“separate_rows”语句中。

     # Create test data
               mydata <- as.data.frame(c("Column1 = answer1; Column2 = answer2;  incorrectly formatted - should be connected with answer2","Column1 = answer1; Column2 = answer2;  incorrectly formatted - should be connected with answer2","Column1 = answer1; Column2 = answer2;  incorrectly formatted - should be connected with answer2")) 
names(mydata) <- "TEST"
mydata$TEST <- as.character(mydata$TEST)

        # convert to 2 columns with a row counter
        mydata %>% 
        mutate(row=row.names(mydata)) %>%
        separate_rows(TEST, sep = '; ')

当前输出:

row|TEST
1|Column1 = answer1
1|Column2 = answer2
1|incorrectly formatted - should be connected with answer2
2|Column1 = answer1
2|Column2 = answer2
2|incorrectly formatted - should be connected with answer2
3|Column1 = answer1
3|Column2 = answer2
3|incorrectly formatted - should be connected with answer2

输出我正在寻找:

row|TEST
1|Column1 = answer1
1|Column2 = answer2;  incorrectly formatted - should be connected with answer2
2|Column1 = answer1
2|Column2 = answer2;  incorrectly formatted - should be connected with answer2
3|Column1 = answer1
3|Column2 = answer2;  incorrectly formatted - should be connected with answer2

非常感谢任何帮助!

2 个答案:

答案 0 :(得分:3)

您可以使用positive lookaround(在您的情况下为前瞻)来解决您的问题:

阅读:http://www.regular-expressions.info/lookaround.html

library(tidyverse)
    mydata <- as.data.frame(c("Column1 = answer1; Column2 = answer2;  incorrectly formatted - should be connected with answer2","Column1 = answer1; Column2 = answer2;  incorrectly formatted - should be connected with answer2","Column1 = answer1; Column2 = answer2;  incorrectly formatted - should be connected with answer2")) 
    names(mydata) <- "TEST"
    mydata$TEST <- as.character(mydata$TEST)
    View(mydata)
    library(tidyverse)
    mydata %>% 
      mutate(row=row.names(mydata)) %>%
      separate_rows(TEST, sep = ';(?=\\s[A-Z])')

<强>输出

    row
1   1
2   1
3   2
4   2
5   3
6   3
                                                                           TEST
1                                                             Column1 = answer1
2  Column2 = answer2;  incorrectly formatted - should be connected with answer2
3                                                             Column1 = answer1
4  Column2 = answer2;  incorrectly formatted - should be connected with answer2
5                                                             Column1 = answer1
6  Column2 = answer2;  incorrectly formatted - should be connected with answer2

括号内的正则表达式会检查模式,但不会捕获它。匹配过程中的Hence元素永远不会在匹配中被吃掉。

答案 1 :(得分:1)

我们可以mutate使用不同的分隔符,然后执行separate_rows

library(tidyverse)
rownames_to_column(mydata, 'rn') %>%  
       mutate(TEST = sub(";\\s+(?=Column)", ",", TEST, perl = TRUE)) %>%
       separate_rows(TEST, sep=",")
#  rn                                                                         TEST
#1  1                                                            Column1 = answer1
#2  1 Column2 = answer2;  incorrectly formatted - should be connected with answer2
#3  2                                                            Column1 = answer1
#4  2 Column2 = answer2;  incorrectly formatted - should be connected with answer2
#5  3                                                            Column1 = answer1
#6  3 Column2 = answer2;  incorrectly formatted - should be connected with answer2