Question

我正在尝试修改一个词干函数，该函数能够1）删除http（出现在语料库中）中的连字符，但与此同时，2）保留以有意义的连字符表达式（例如，耗时，花费）出现的连字符-禁止等）。实际上，几个月前我在另一个question thread上问过类似的问题，代码看起来像这样：

# load stringr to use str_replace_all
require(stringr)

clean.text = function(x)
{
  # remove rt
  x = gsub("rt ", "", x)
  # remove at
  x = gsub("@\\w+", "", x)
  x = gsub("[[:punct:]]", "", x)
  x = gsub("[[:digit:]]", "", x)
  # remove http
  x = gsub("http\\w+", "", x)
  x = gsub("[ |\t]{2,}", "", x)
  x = gsub("^ ", "", x)
  x = gsub(" $", "", x)
  x = str_replace_all(x, "[^[:alnum:][:space:]'-]", " ")
  #return(x)
}

# example
my_text <- "accident-prone"
new_text <- clean.text(text)
new_text
[1] "accidentprone"

但无法获得满意的答案，然后我将注意力转移到其他项目上，直到继续进行此工作。看来，代码块最后一行中的"[^[:alnum:][:space:]'-]"也是将-从语料库的非http部分中删除的罪魁祸首。

我不知道如何实现期望的输出，如果有人能对此提供见解，将不胜感激。

Answer 1

真正的罪魁祸首是[[:punct:]]删除模式，因为它与字符串中任何位置的-相匹配。

您可以使用

clean.text <- function(x)
{
  # remove rt
  x <- gsub("rt\\s", "", x)
  # remove at
  x <- gsub("@\\w+", "", x)
  x <- gsub("\\b-\\b(*SKIP)(*F)|[[:punct:]]", "", x, perl=TRUE)
  x <- gsub("[[:digit:]]+", "", x)
  # remove http
  x <- gsub("http\\w+", "", x)
  x <- gsub("\\h{2,}", "", x, perl=TRUE)
  x <- trimws(x)
  x <- gsub("[^[:alnum:][:space:]'-]", " ", x)
  return(x)
}

然后

my_text <- "  accident-prone  http://www.some.com  rt "
new_text <- clean.text(my_text)
new_text 
## => [1] "accident-prone"

请参见R demo。

注意：

x = gsub("^ ", "", x)和x = gsub(" $", "", x)可以替换为trimws(x)
gsub("\\b-\\b(*SKIP)(*F)|[[:punct:]]", "", x, perl=TRUE)删除了字符char之间的所有标点符号，但连字符（您可以在(*SKIP)(*F)之前的部分中对此进行进一步调整）
gsub("[^[:alnum:][:space:]'-]", " ", x)是str_replace_all(x, "[^[:alnum:][:space:]'-]", " ")的基数R。
gsub("\\h{2,}", "", x, perl=TRUE)删除任何2个或更多水平空白。如果到"[ |\t]{2,}"时您要匹配任意两个或更多个空格，请在此处使用\\s代替\\h。

删除http中的连字符，但保留语料中的连字符

1 个答案: