我正在尝试重构我的数据,并且它会挂起某些单元格字符串,所以我想调整它以使分隔符只是以下情况:
-End of word
-“;”
-1 space
-Capitol Letter
我是REGEX的新手,但这似乎捕获了我正在寻找的东西:
";\s[A-Z]"
但是,它还包括第二个单词的第一个字母,我不想成为分隔符的一部分。而且我不确定如何将它合并到我的“separate_rows”语句中。
# Create test data
mydata <- as.data.frame(c("Column1 = answer1; Column2 = answer2; incorrectly formatted - should be connected with answer2","Column1 = answer1; Column2 = answer2; incorrectly formatted - should be connected with answer2","Column1 = answer1; Column2 = answer2; incorrectly formatted - should be connected with answer2"))
names(mydata) <- "TEST"
mydata$TEST <- as.character(mydata$TEST)
# convert to 2 columns with a row counter
mydata %>%
mutate(row=row.names(mydata)) %>%
separate_rows(TEST, sep = '; ')
当前输出:
row|TEST
1|Column1 = answer1
1|Column2 = answer2
1|incorrectly formatted - should be connected with answer2
2|Column1 = answer1
2|Column2 = answer2
2|incorrectly formatted - should be connected with answer2
3|Column1 = answer1
3|Column2 = answer2
3|incorrectly formatted - should be connected with answer2
输出我正在寻找:
row|TEST
1|Column1 = answer1
1|Column2 = answer2; incorrectly formatted - should be connected with answer2
2|Column1 = answer1
2|Column2 = answer2; incorrectly formatted - should be connected with answer2
3|Column1 = answer1
3|Column2 = answer2; incorrectly formatted - should be connected with answer2
非常感谢任何帮助!
答案 0 :(得分:3)
您可以使用positive lookaround
(在您的情况下为前瞻)来解决您的问题:
阅读:http://www.regular-expressions.info/lookaround.html
library(tidyverse)
mydata <- as.data.frame(c("Column1 = answer1; Column2 = answer2; incorrectly formatted - should be connected with answer2","Column1 = answer1; Column2 = answer2; incorrectly formatted - should be connected with answer2","Column1 = answer1; Column2 = answer2; incorrectly formatted - should be connected with answer2"))
names(mydata) <- "TEST"
mydata$TEST <- as.character(mydata$TEST)
View(mydata)
library(tidyverse)
mydata %>%
mutate(row=row.names(mydata)) %>%
separate_rows(TEST, sep = ';(?=\\s[A-Z])')
<强>输出强>:
row
1 1
2 1
3 2
4 2
5 3
6 3
TEST
1 Column1 = answer1
2 Column2 = answer2; incorrectly formatted - should be connected with answer2
3 Column1 = answer1
4 Column2 = answer2; incorrectly formatted - should be connected with answer2
5 Column1 = answer1
6 Column2 = answer2; incorrectly formatted - should be connected with answer2
括号内的正则表达式会检查模式,但不会捕获它。匹配过程中的Hence元素永远不会在匹配中被吃掉。
答案 1 :(得分:1)
我们可以mutate
使用不同的分隔符,然后执行separate_rows
library(tidyverse)
rownames_to_column(mydata, 'rn') %>%
mutate(TEST = sub(";\\s+(?=Column)", ",", TEST, perl = TRUE)) %>%
separate_rows(TEST, sep=",")
# rn TEST
#1 1 Column1 = answer1
#2 1 Column2 = answer2; incorrectly formatted - should be connected with answer2
#3 2 Column1 = answer1
#4 2 Column2 = answer2; incorrectly formatted - should be connected with answer2
#5 3 Column1 = answer1
#6 3 Column2 = answer2; incorrectly formatted - should be connected with answer2