由于PCRE配置导致R错误,unicode属性

时间:2017-02-16 15:01:16

标签: r regex unicode pcre tm

我正在使用tm包中的removeWords和tm_map()函数来解析一些文本数据。我的理解是它只是通过gsub()使用Perl正则表达式来完成任务。

然而,当我运行我的代码时,我得到一个奇怪的错误。我使用的是R 3.3.2。

docs <- tm_map(docs, removeWords, stopwords("english"), mc.cores=1)

我得到......

Error in gsub(sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE),  : 
  invalid regular expression '(*UCP)\b(yourselves|yourself|yours|your|you've|you're|you'll|you'd|you|wouldn't|would|won't|with|why's|why|whom|who's|who|while|which|where's|where|when's|when|what's|what|weren't|were|we've|we're|we'll|we'd|we|wasn't|was|very|up|until|under|too|to|through|those|this|they've|they're|they'll|they'd|they|these|there's|there|then|themselves|them|theirs|their|the|that's|that|than|such|some|so|shouldn't|should|she's|she'll|she'd|she|shan't|same|own|over|out|ourselves|ours|our|ought|other|or|only|once|on|off|of|not|nor|no|myself|my|mustn't|most|more|me|let's|itself|its|it's|it|isn't|is|into|in|if|i've|i'm|i'll|i'd|i|how's|how|his|himself|him|herself|hers|here's|here|her|he's|he'll|he'd|he|having|haven't|have|hasn't|has|hadn't|had|further|from|for|few|each|during|down|don't|doing|doesn't|does|do|didn't|did|couldn't|could|cannot|can't|by|but|both|between|below|being|before|been|because|be|at|as|aren't|are|any|and|an|am|all|against|again|after|above|about|a
In addition: Warning message:
In gsub(sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE),  :
  PCRE pattern compilation error
    'this version of PCRE is not compiled with Unicode property support'
    at '(*UCP)\b(yourselves|yourself|yours|your|you've|you're|you'll|you'd|you|wouldn't|would|won't|with|why's|why|whom|who's|who|while|which|where's|where|when's|when|what's|what|weren't|were|we've|we're|we'll|we'd|we|wasn't|was|very|up|until|under|too|to|through|those|this|they've|they're|they'll|they'd|they|these|there's|there|then|themselves|them|theirs|their|the|that's|that|than|such|some|so|shouldn't|should|she's|she'll|she'd|she|shan't|same|own|over|out|ourselves|ours|our|ought|other|or|only|once|on|off|of|not|nor|no|myself|my|mustn't|most|more|me|let's|itself|its|it's|it|isn't|is|into|in|if|i've|i'm|i'll|i'd|i|how's|how|his|himself|him|herself|hers|here's|here|her|he's|he'll|he'd|he|having|haven't|have|hasn't|has|hadn't|had|further|from|for|few|each|during|down|don't|doing|doesn't|does|do|didn't|did|couldn't|could|cannot|can't|by|but|both|between|below|being|before|been|because|be| [... truncated]

据我了解,重要的部分是“此版本的PCRE未使用Unicode属性支持进行编译。”关于如何解决这个问题的任何想法?我在R中运行了pcre_config()并获得了以下内容:

     UTF-8 Unicode properties                JIT 
      TRUE              FALSE              FALSE 

在R之外,我运行了pcretest -C并获得了以下内容:

PCRE version 7.8 2008-09-05
Compiled with
  UTF-8 support
  Unicode properties support
  Newline sequence is LF
  \R matches all Unicode newlines
  Internal link size = 2
  POSIX malloc threshold = 10
  Default match limit = 10000000
  Default recursion depth limit = 10000000
  Match recursion uses stack

非常感谢任何反馈。

1 个答案:

答案 0 :(得分:1)

RickyB

在尝试创建词云工具时,我遇到了同样的问题。由于某些原因,“停用词”功能无法正常工作。

我在这里找到了解决方案: Manual removal of stopwords

在上面的链接中对代码进行了一些更改之后,这是我的代码:

docs <- tm_map(docs, removeWords, stopwords("english"), mc.cores=1)

手动删除停用词:

r <- read.table(fill=TRUE, url("http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a11-smart-stop-list/"))
stopWords <- r
vstop <- as.vector(stopWords)
stpWrd <- stopwords("SMART")
text <- unlist(text)[!(unlist(text) %in% stpWrd)]

希望对您有帮助。