在以下字符串中:
"I may opt for a yam for Amy, May, and Tommy."
如何删除非字母字符并将所有字母转换为小写字母并对R中每个字母内的字母进行排序?
与此同时,我尝试对句子中的单词进行排序并删除重复项。
答案 0 :(得分:5)
您可以使用stringi
library(stringi)
unique(stri_sort(stri_trans_tolower(stri_extract_all_words(txt, simplify = TRUE))))
给出了:
## [1] "a" "amy" "and" "for" "i" "may" "opt" "tommy" "yam"
<强>更新强>
正如@DavidArenburg所提到的,我忽略了“将字母中的字母排序”作为问题的一部分。您没有提供所需的输出,也没有立即应用程序,但是,假设您想要识别哪些单词具有匹配的对应项(字符串距离为0):
unique(stri_sort(stri_trans_tolower(stri_extract_all_words(txt, simplify = TRUE)))) %>%
stringdistmatrix(., ., useNames = "strings", method = "qgram") %>%
# a amy and for i may opt tommy yam
# a 0 2 2 4 2 2 4 6 2
# amy 2 0 4 6 4 0 6 4 0
# and 2 4 0 6 4 4 6 8 4
# for 4 6 6 0 4 6 4 6 6
# i 2 4 4 4 0 4 4 6 4
# may 2 0 4 6 4 0 6 4 0
# opt 4 6 6 4 4 6 0 4 6
# tommy 6 4 8 6 6 4 4 0 4
# yam 2 0 4 6 4 0 6 4 0
apply(., 1, function(x) sum(x == 0, na.rm=TRUE))
# a amy and for i may opt tommy yam
# 1 3 1 1 1 3 1 1 3
每行0
多个"amy", "may", "yam"
的字词有加扰对应字词。
答案 1 :(得分:4)
str <- "I may opt for a yam for Amy, May, and Tommy."
## Clean the words (just keep letters and convert to lowercase)
words <- strsplit(tolower(gsub("[^A-Za-z ]", "", str)), " ")[[1]]
## split the words into characters and sort them
sortedWords <- sapply(words, function(word) sort(unlist(strsplit(word, ""))))
## Join the sorted letters back together
sapply(sortedWords, paste, collapse="")
# i may opt for a yam for amy may and
# "i" "amy" "opt" "for" "a" "amy" "for" "amy" "amy" "adn"
# tommy
# "mmoty"
## If you want to convert result back to string
do.call(paste, lapply(sortedWords, paste, collapse=""))
# [1] "i amy opt for a amy for amy amy adn mmoty"
答案 2 :(得分:4)
stringr
将让您处理R和C-speed中的所有字符集,而magrittr
将允许您使用适合您需求的管道习语:
library(stringr)
library(magrittr)
txt <- "I may opt for a yam for Amy, May, and Tommy."
txt %>%
str_to_lower %>% # lowercase
str_replace_all("[[:punct:][:digit:][:cntrl:]]", "") %>% # only alpha
str_replace_all("[[:space:]]+", " ") %>% # single spaces
str_split(" ") %>% # tokenize
extract2(1) %>% # str_split returns a list
sort %>% # sort
unique # unique words
## [1] "a" "amy" "and" "for" "i" "may" "opt" "tommy" "yam"
答案 3 :(得分:4)
我维护的 qdap 包具有适用于此的bag_o_words
函数:
txt <- "I may opt for a yam for Amy, May, and Tommy."
library(qdap)
unique(sort(bag_o_words(txt)))
## [1] "a" "amy" "and" "for" "i" "may" "opt" "tommy" "yam"