如何在R中编写用于文本挖掘的自定义控件函数

时间:2015-04-16 11:57:42

标签: r nlp tm

我有一个文本数据集,包括不同的主题,写得很糟糕,有很多单词粘在一起。我找到了一种方法来打破“长期”#34;用足够的话说,我创建了一个这样的数据表(只显示对这个例子有用的行):

>wordMap[1:4] longTerm words 1: linedconcealed lined,concealed 2: jacketcontrasting jacket,contrasting 3: spandexspecialist spandex,specialist 4: inchescmmidweightsemi inchescm,mid,weight,semi

请记住,words列本身就是一个单词列表

>class(wordMap$words)
"list"
>wordMap$words[1]
[[1]]
[1] "lined"     "concealed"

使用此数据表我想创建一个控制函数,以传递给tm_map包中的函数tm。基于这篇文章custom removePunctuation我构建了这个函数:

>mapWords <- function(x) UseMethod("mapWords", x)
>mapWords.PlainTextDocument <- mapWords.character <- function(x) {

    if (x %in% wordMap$longTerm) {
        i <- which(x == wordMap$longTerm)
        x=wordMap$words[[i]]
    } 

    return(x)
}

我用以下语料库尝试了这个功能:

testC[[1]] <<PlainTextDocument (metadata: 7)>> alice + olivia black wool blend jacketcontrasting weave, padded shoulders, leather trims, fully linedconcealed hook fastenings through frontfabric % wool, % cotton; fabric % leather; fabric % rayon, % nylon, % elastane; lining % polyester, % spandexspecialist cleanlength shoulder to hem inchescmmidweightsemi fitted stylethis style runs true to sizemodel is '"cm and wears a size small

有四个&#34; longTerms&#34;这是上面数据表中包含的四个示例(因此问题是可重现的)。当我运行tm_map时,我收到以下错误:

>tm_map(testC, mapWords) 

Warning message: In if (x %in% wordMap$longTerm) { : the condition has length > 1 and only the first element will be used

非常清楚。条件不止一次匹配。但是,我虽然map函数是基于&#34;令牌基础&#34;,但是逐个传递给控制函数标记。通常,我的自定义控件功能使用gsub或类似功能,因此它们的编码与我上面的mapWords功能不同。以下是一个有效的控制功能示例:

rmRepeatLetters <- function(str) gsub('([[:alpha:]])\\1{2,}', '\\1', str)

总的来说,我想知道如何对mapWords函数进行编码,以便将其传递给tm_map

请注意,我可以直接在文字上执行此练习:

>testT <- as.character(testC[[1]])
>terms <- paste(as.character(unlist(sapply(unlist(strsplit(testT, " ")), function(x) mapWords(x)))), collapse = " ")
>terms

"alice + olivia black wool blend jacket contrasting weave, padded shoulders, leather trims, fully lined concealed hook fastenings through frontfabric % wool, % cotton; fabric % leather; fabric % rayon, % nylon, % elastane; lining % polyester, % spandex specialist cleanlength shoulder to hem inchescm mid weight semi fitted stylethis style runs true to sizemodel is '\"cm and wears a size small"

但是,我想知道是否有办法建立一个自定义控制功能来打破语料库本身的长期条款。

提前感谢您的时间!

0 个答案:

没有答案