R中使用自定义关键字进行文本分析

时间:2015-04-09 14:21:43

标签: r corpus text-analysis

我正在尝试使用R的tm包来渲染我的文本数据。

现在我的数据语料库采用以下形式:

1. The sports team practiced today
2. The soccer team went took the day off

然后数据将被矢量化为:

<the, sports, team, practiced, today, soccer, went, took, off>
1.  <1, 1, 1, 1, 1, 0, 0, 0, 0>
2.  <1, 0, 1, 0, 0, 1, 1, 1, 1>

我更喜欢为我的矢量使用一组自定义短语,例如:

<sports team, soccer team, practiced today, day off>
1. <1, 0, 1, 0>
2. <0, 1, 0, 1>

R中是否有包或函数可以执行此操作?或者是否有其他具有类似功能的开源资源?谢谢。

2 个答案:

答案 0 :(得分:2)

您询问了其他文字套餐,我们欢迎您尝试使用Paul Nulty开发的quanteda

在下面的代码中,首先定义所需的多字短语,作为使用dictionary()构造函数键入quanteda“dictionary”类对象的命名列表,然后使用{{1将您文本中的短语转换为单个“标记”,由下划线连接的短语组成。令牌器忽略了下划线,因此您的短语将被视为单字标记。

phrasetotoken()是文档特征矩阵的构造函数,可以采用定义要保留的特征的正则表达式,这里包含下划线字符的任何短语(正则表达式当然可以改进但我保留了它这里故意简单)。 dfm()有很多选项 - 请参阅dfm()

?dfm

很高兴为您解决任何install.packages("quanteda") library(quanteda) mytext <- c("The sports team practiced today", "The soccer team went took the day off") myphrases <- dictionary(list(myphrases=c("sports team", "soccer team", "practiced today", "day off"))) mytext2 <- phrasetotoken(mytext, myphrases) mytext2 ## [1] "The sports_team practiced_today" "The soccer_team went took the day_off" # keptFeatures is a regular expression: keep only phrases mydfm <- dfm(mytext2, keptFeatures = "_", verbose=FALSE) mydfm ## Document-feature matrix of: 2 documents, 4 features. ## 2 x 4 sparse Matrix of class "dfmSparse" ## features ## docs day_off practiced_today soccer_team sports_team ## text1 0 1 0 1 ## text2 1 0 1 0 相关问题,包括功能请求,如果您可以建议改进词组处理。

答案 1 :(得分:0)

这样的事情怎么样?

library(tm)

text <- c("The sports team practiced today", "The soccer team went took the day off")

corpus <- Corpus(VectorSource(text))

tokenizing.phrases <- c("sports team", "soccer team", "practiced today", "day off")  

phraseTokenizer <- function(x) {
  require(stringr)

  x <- as.character(x) # extract the plain text from the tm TextDocument object
  x <- str_trim(x)
  if (is.na(x)) return("")
  phrase.hits <- str_detect(x, ignore.case(tokenizing.phrases))

  if (any(phrase.hits)) {
    # only split once on the first hit, so we don't have to worry about multiple occurences of the same phrase
    split.phrase <- tokenizing.phrases[which(phrase.hits)[1]] 
    # warning(paste("split phrase:", split.phrase))
    temp <- unlist(str_split(x, ignore.case(split.phrase), 2))
    out <- c(phraseTokenizer(temp[1]), split.phrase, phraseTokenizer(temp[2])) # this is recursive, since f() calls itself
  } else {
    out <- MC_tokenizer(x)
  }

  # get rid of any extraneous empty strings, which can happen if a phrase occurs just before a punctuation
  out[out != ""]
}


tdm <- TermDocumentMatrix(corpus, control = list(tokenize = phraseTokenizer))

> Terms(tdm)
[1] "day off"         "practiced today" "soccer team"     "sports team"     "the"             "took"           
[7] "went"