Question

我有一个文本文件，有几百行长。我试图删除除“/”字符之外的所有[edit：add]标点字符。我目前在qdap包中使用strip函数。

以下是一个示例数据集：

htxt <- c("{rtf1ansiansicpg1252cocoartf1038cocoasubrtf360/", 
        "{fonttblf0fswissfcharset0 helvetica",
        "margl1440margr1440vieww9000viewh8400viewkind0")

以下是代码：

strip(htxt, char.keep = "/", digit.remove = F, apostrophe.remove = TRUE, lower.case = TRUE)

这个漂亮功能的唯一问题是它删除了“/”字符。如果我尝试删除除“{”字符以外的所有字符：

strip(htxt, char.keep = "{", digit.remove = F, apostrophe.remove = TRUE, lower.case = TRUE)

有没有人遇到同样的问题？

Answer 1

无论出于何种原因，似乎qdap:::strip 总是从字符向量中剥离"/"。这是在功能结束时的源代码中：

x <- clean(gsub("/", " ", gsub("-", " ", x)))

这是在实际函数之前运行的，该函数执行函数体strip中定义的剥离....

所以只需用您自己的版本替换该功能：

strip.new <- function (x, char.keep = "~~", digit.remove = TRUE, apostrophe.remove = TRUE, 
    lower.case = TRUE) 
{
    strp <- function(x, digit.remove, apostrophe.remove, char.keep, 
        lower.case) {
        if (!is.null(char.keep)) {
            x2 <- Trim(gsub(paste0(".*?($|'|", paste(paste0("\\", 
                char.keep), collapse = "|"), "|[^[:punct:]]).*?"), 
                "\\1", as.character(x)))
        }
        else {
            x2 <- Trim(gsub(".*?($|'|[^[:punct:]]).*?", "\\1", 
                as.character(x)))
        }
        if (lower.case) {
            x2 <- tolower(x2)
        }
        if (apostrophe.remove) {
            x2 <- gsub("'", "", x2)
        }
        ifelse(digit.remove == TRUE, gsub("[[:digit:]]", "", 
            x2), x2)
    }
    unlist(lapply(x, function(x) Trim(strp(x = x, digit.remove = digit.remove, 
        apostrophe.remove = apostrophe.remove, char.keep = char.keep, 
        lower.case = lower.case))))
}

strip.new(htxt, char.keep = "/", digit.remove = F, apostrophe.remove = TRUE, lower.case = TRUE)

#[1] "rtf1ansiansicpg1252cocoartf1038cocoasubrtf360/"
#[2] "fonttblf0fswissfcharset0 helvetica"            
#[3] "margl1440margr1440vieww9000viewh8400viewkind0"

软件包作者在此网站上非常活跃，因此他可能会清除默认情况下strip执行此操作的原因。

Answer 2

为什么不：

> gsub("[^/]", "", htxt)
[1] "/" ""  ""

鉴于@ SimonO101的澄清，正则表达式方法可能是：

gsub("[]!\"#$%&'()*+,.:;<=>?@[^_`{|}~-]", "", htxt)

请注意，该序列中的第一项是“]”，最后一项是“ - ”，双引号需要转义。这是[：punct：]的目标，删除了“\”。以编程方式执行此操作，您可以使用：

rem.some.punct <- function(txt, notpunct=NULL){ 
       punctstr <- "[]!\"#$%&'()*/+,.:;<=>?@[^_`{|}~-]"
       rempunct <- gsub(paste0("",notpunct), "", punctstr)
       gsub(rempunct, "", txt)}

R中的qdap包中的剥离功能 - 错误地删除斜杠

2 个答案: