我有一个文本文件,有几百行长。我试图删除除“/”字符之外的所有[edit:add]标点字符。我目前在qdap包中使用strip函数。
以下是一个示例数据集:
htxt <- c("{rtf1ansiansicpg1252cocoartf1038cocoasubrtf360/",
"{fonttblf0fswissfcharset0 helvetica",
"margl1440margr1440vieww9000viewh8400viewkind0")
以下是代码:
strip(htxt, char.keep = "/", digit.remove = F, apostrophe.remove = TRUE, lower.case = TRUE)
这个漂亮功能的唯一问题是它删除了“/”字符。如果我尝试删除除“{”字符以外的所有字符:
strip(htxt, char.keep = "{", digit.remove = F, apostrophe.remove = TRUE, lower.case = TRUE)
有没有人遇到同样的问题?
答案 0 :(得分:1)
无论出于何种原因,似乎qdap:::strip
总是从字符向量中剥离"/"
。这是在功能结束时的源代码中:
x <- clean(gsub("/", " ", gsub("-", " ", x)))
这是在实际函数之前运行的,该函数执行函数体strip
中定义的剥离....
所以只需用您自己的版本替换该功能:
strip.new <- function (x, char.keep = "~~", digit.remove = TRUE, apostrophe.remove = TRUE,
lower.case = TRUE)
{
strp <- function(x, digit.remove, apostrophe.remove, char.keep,
lower.case) {
if (!is.null(char.keep)) {
x2 <- Trim(gsub(paste0(".*?($|'|", paste(paste0("\\",
char.keep), collapse = "|"), "|[^[:punct:]]).*?"),
"\\1", as.character(x)))
}
else {
x2 <- Trim(gsub(".*?($|'|[^[:punct:]]).*?", "\\1",
as.character(x)))
}
if (lower.case) {
x2 <- tolower(x2)
}
if (apostrophe.remove) {
x2 <- gsub("'", "", x2)
}
ifelse(digit.remove == TRUE, gsub("[[:digit:]]", "",
x2), x2)
}
unlist(lapply(x, function(x) Trim(strp(x = x, digit.remove = digit.remove,
apostrophe.remove = apostrophe.remove, char.keep = char.keep,
lower.case = lower.case))))
}
strip.new(htxt, char.keep = "/", digit.remove = F, apostrophe.remove = TRUE, lower.case = TRUE)
#[1] "rtf1ansiansicpg1252cocoartf1038cocoasubrtf360/"
#[2] "fonttblf0fswissfcharset0 helvetica"
#[3] "margl1440margr1440vieww9000viewh8400viewkind0"
软件包作者在此网站上非常活跃,因此他可能会清除默认情况下strip
执行此操作的原因。
答案 1 :(得分:1)
为什么不:
> gsub("[^/]", "", htxt)
[1] "/" "" ""
鉴于@ SimonO101的澄清,正则表达式方法可能是:
gsub("[]!\"#$%&'()*+,.:;<=>?@[^_`{|}~-]", "", htxt)
请注意,该序列中的第一项是“]”,最后一项是“ - ”,双引号需要转义。这是[:punct:]的目标,删除了“\”。以编程方式执行此操作,您可以使用:
rem.some.punct <- function(txt, notpunct=NULL){
punctstr <- "[]!\"#$%&'()*/+,.:;<=>?@[^_`{|}~-]"
rempunct <- gsub(paste0("",notpunct), "", punctstr)
gsub(rempunct, "", txt)}