我已经阅读了一些有关拆分大写和小写字母(例如this和this)的不错的问题,但是我无法设法使它们与我的数据一起使用。
# here my data
data <- data.frame(text = c("SOME UPPERCASES And some Lower Cases"
,"OTHER UPPER CASES And other words"
, "Some lower cases AND UPPER CASES"
,"ONLY UPPER CASES"
,"Only lower cases, maybe"
,"UPPER lower UPPER!"))
data
text
1 SOME UPPERCASES And some Lower Cases
2 OTHER UPPER CASES And other words
3 Some lower cases AND UPPER CASES
4 ONLY UPPER CASES
5 Only lower cases, maybe
6 UPPER lower UPPER!
所需的结果应该是这样的:
V1 V2
1 SOME UPPERCASES And some Lower Cases
2 OTHER UPPER CASES And other words
3 AND UPPER CASES Some lower cases
4 ONLY UPPER CASES NA
5 NA Only lower cases, maybe
6 UPPER UPPER! lower
因此,将所有只有大写字母的单词与其他单词分开。
作为测试,我只尝试了某种方式的一行,但没有一种能很好地工作:
strsplit(x= data$text[1], split="[[:upper:]]") # error
gsub('([[:upper:]])', ' \\1', data$text[1]) # not good results
library(reshape)
transform(data, FOO = colsplit(data$text[1], split = "[[:upper:]]", names = c('a', 'b'))) # neither good results
答案 0 :(得分:1)
数据:
data <- data.frame(text = c("SOME UPPERCASES And some Lower Cases"
,"OTHER UPPER CASES And other words"
, "Some lower cases AND UPPER CASES"
,"ONLY UPPER CASES"
,"Only lower cases, maybe"
,"UPPER lower UPPER!"))
代码:
library(magrittr)
UpperCol <- regmatches(data$text , gregexpr("\\b[A-Z]+\\b",data$text)) %>% lapply(paste, collapse = " ") %>% unlist
notUpperCol <- regmatches(data$text , gregexpr("\\b(?![A-Z]+\\b)[a-zA-Z]+\\b",data$text, perl = T)) %>% lapply(paste, collapse = " ") %>% unlist
result <- data.frame(I(UpperCol), I(notUpperCol))
result[result == ""] <- NA
结果:
# UpperCol notUpperCol
#1 SOME UPPERCASES And some Lower Cases
#2 OTHER UPPER CASES And other words
#3 AND UPPER CASES Some lower cases
#4 ONLY UPPER CASES <NA>
#5 <NA> Only lower cases maybe
#6 UPPER UPPER lower
答案 1 :(得分:1)
使用stringi软件包的方法:
library(stringi)
l1 <- stri_extract_all_regex(dat$text, "\\b[A-Z]+\\b")
l2 <- mapply(setdiff, stri_extract_all_words(dat$text), l1)
res <- data.frame(all_upper = sapply(l1, paste, collapse = " "),
not_all_upper = sapply(l2, paste, collapse = " "),
stringsAsFactors = FALSE)
res[res == "NA"] <- NA
res[res == ""] <- NA
给出:
> res all_upper not_all_upper 1 SOME UPPERCASES And some Lower Cases 2 OTHER UPPER CASES And other words 3 AND UPPER CASES Some lower cases 4 ONLY UPPER CASES <NA> 5 <NA> Only lower cases maybe 6 UPPER UPPER lower
答案 2 :(得分:1)
separate <- function(x) {
x <- unlist(strsplit(as.character(x), "\\s+"))
with_lower <- grepl("\\p{Ll}", x, perl = TRUE)
list(paste(x[!with_lower], collapse = " "), paste(x[with_lower], collapse = " "))
}
do.call(rbind, lapply(data$text, separate))
[,1] [,2]
[1,] "SOME UPPERCASES" "And some Lower Cases"
[2,] "OTHER UPPER CASES" "And other words"
[3,] "AND UPPER CASES" "Some lower cases"
[4,] "ONLY UPPER CASES" ""
[5,] "" "Only lower cases, maybe"
[6,] "UPPER UPPER!" "lower"