我正在尝试编写一个带有正则表达式的程序来清理一些数据。假设我有一个带有字母和数字的房间名称。在最终输出中,我需要使用“完整字符串(不包括字母和数字)+字母+数字”模式输出房间名称,如下例所示。但是,到目前为止我写的正则表达式,我得到了非常混乱的结果,这是我的消息的底部。由于某种原因,它会在某些行上放置字母和字符,即使输入数据中可能没有。谢谢。
编辑:我对输入数据进行了编辑。我想概括代码来获取任意数量的字符串,而不仅仅是单词“ROOM”。
# the pattern should be "the full string (excluding letter & number) + letter + number". For example:
ATLANTA ROOM
ATLANTA ROOM 3
NEW YORK ROOM A 2
ROOM A 4
THE BIG AWESOME ROOM B
ROOM B 4
GEORGETOWN ROOM B 2
NEW YORK ROOM C 2
NEW YORK ROOM C
LOS ANGELES ROOM E 2
# program to clean with regular expressions. there could be multiple spaces between words
dd <- c("ATLANTA ROOM ",
" ATLANTA ROOM 3",
"NEW YORK A ROOM 2",
"4 ROOM A",
"THE BIG AWESOME ROOM B",
" ROOM 4 B",
"GEORGETOWN B 2 ROOM ",
" C NEW YORK ROOM 2",
"NEW YORK ROOM C",
"LOS ANGELES ROOM 2 E")
m_char_num <- regexpr("(\\<A|B|C|D|E|1|2|3|4\\>)", dd)
m_char <- regexpr("(\\<A|B|C|D|E\\>)", dd)
m_num <- regexpr("(\\<1|2|3|4\\>)", dd)
(dd2 <- paste(gsub("( +)", " ",
gsub("(^ +)|( +$)", "",
gsub("(\\<A|B|C|D|E|1|2|3|4\\>)", "", dd))),
regmatches(dd, m_char), regmatches(dd, m_num), sep = " "))
# actual output from the program
"TLANTA ROOMA3",
"TLANTA ROOMA2",
"NW YORK ROOMA4",
"ROOMA4",
"TH IG WSOM ROOME2",
"ROOMB2",
"GORGTOWN ROOMB2",
"NW YORK ROOMC3",
"NW YORK ROOMC2",
"LOS NGLS ROOMA4"
答案 0 :(得分:4)
这是一次尝试:
sub(' $', '', # clean up spaces at the end
gsub(' +', ' ', # clean up double spaces
# rearrange letter and numbers
sub('^([A-Z]?)([0-9]*)([A-Z]?)$', 'ROOM \\1\\3 \\2',
gsub(' |ROOM', '', dd) # remove spaces and ROOM
)
)
)
#[1] "ROOM" "ROOM 3" "ROOM A 2" "ROOM A 4" "ROOM B" "ROOM B 4" "ROOM B 2"
#[8] "ROOM C 2" "ROOM C" "ROOM E 2"
以下是编辑后的OP和评论的相同逻辑(假设房间名称是至少包含3个字母,最多为2个字母的房间名称):
gsub('(^ | $)', '', # clean up spaces in front or end
gsub(' +', ' ', # clean up double spaces
# extract room name and put it in front of the letter and number
paste(gsub('\\b([A-Z][A-Z]?|[0-9]+)\\b', '', dd, perl = T),
sub('^([A-Z]+)?([0-9]*)([A-Z]+)?$', '\\1\\3 \\2',
gsub(' |\\w\\w\\w+', '', dd) # remove spaces and words
)
)
)
)
答案 1 :(得分:2)
所以,正在发生的事情是你的程序只有8个字母,所以不是插入“”或NA,而是回收它们。
这是一个修复:
m_char_num <- regexpr("(\\<A|B|C|D|E|1|2|3|4\\>)", dd)
m_char <- regexpr("(\\<A|B|C|D|E\\>)", dd)
m_num <- regexpr("(\\<1|2|3|4\\>)", dd)
numbers <- rep("", length(dd))
numbers[m_num>0] <- regmatches(dd, m_num)
letters <- rep("", length(dd))
letters[m_char>0] <- regmatches(dd, m_char)
output <- trim(paste("ROOM", letters, numbers))
[1]“房间”“房间3”“房间A 2”“房间A 4”“房间B”“房间B 4”“房间B 2”“房间C 2”“房间C”
[10]“房间E 2”
答案 2 :(得分:0)
试试这个:
library(gsubfn)
# extract numbers (num) and room letters (char)
num <- sapply(strapplyc(dd, "\\d|$"), paste, collapse = "")
char <- sapply(strapplyc(dd, "[A-D]|$"), paste, collapse = "")
# put back together and sort
out <- sort(paste("ROOM", char, num))
# trim spaces (optional)
out <- gsub(" +", " ", sub(" *$", "", out))
> out
[1] "ROOM" "ROOM 2" "ROOM 3" "ROOM A 2" "ROOM A 4" "ROOM B"
[7] "ROOM B 2" "ROOM B 4" "ROOM C" "ROOM C 2"
更新:小改进