删除unicode< + f0b7>来自语料库文本

时间:2014-06-10 17:56:53

标签: r tm

我有一个非常顽固的问题...我似乎无法删除已从<+f0b7>文件加载的Corpora中的<+f0a0>*.txt字符串进入R:

更新以下是示例.txt文件的链接: https://db.tt/qTRKpJYK

Corpus(DirSource("./SomeDirectory/txt/"), readerControl = list(reader = readPlain))

title
 professional staff - contract - permanent position
software c microfocus cobol unix btrieve ibm vm-cms vsam cics jcl
accomplishments
 <+f0b7>
<+f0a0>
responsible maintaining billing system interfaced cellular switching system <+f0b7>
<+f0a0>
developed unix interface ibm mainframe ericsson motorola att cellular switches

我已尝试将其添加到:

badWords <- unique(c(stopwords("en"), 
          stopwords("SMART")[stopwords("SMART") != "c"],
          as.character(1970:2050),
          "<U+F0B7>", "<+f0b7>",
          "<U+F0A0>", "<+f0a0>",
          "january",  "jan",
          "february",   "feb",
          "march",  "mar",
          "april",  "apr",
          "may",    "may",
          "june",   "jun",
          "july",   "jul",
          "august", "aug",
          "september",  "sep",
          "october",    "oct",
          "november",   "nov",
          "december",   "dec"))

使用:

tm_map(candidates.Corpus, removeWords, badWords)

但那并不能以某种方式起作用。我也尝试用gsub("<+f0a0>", "", tmp, perl = FALSE)之类的东西来重复它,并且它在R中的字符串上工作,但是当我读取.txt文件时,这些字符仍然会出现。

这些角色有什么独特之处吗?我如何摆脱它们?

1 个答案:

答案 0 :(得分:1)

确定。问题是你的数据中有一个不寻常的unicode字符。在R中,我们通常将此字符转义为&#34; \ uf0b7&#34;。但是当inspect()打印它的数据时,它会将其编码为&#34;&#34;。观察

sample<-c("Crazy \uf0b7 Character")
cp<-Corpus(VectorSource(sample))
inspect(DocumentTermMatrix(cp))

# A document-term matrix (1 documents, 3 terms)
# 
# Non-/sparse entries: 3/0
# Sparsity           : 0%
# Maximal term length: 9 
# Weighting          : term frequency (tf)
# 
#     Terms
# Docs <U+F0B7> character crazy
#    1        1         1     1

(实际上我必须在运行R 3.0.2的Windows机器上创建此输出 - 它在运行R 3.1.0的Mac上运行良好。)

不幸的是,你无法删除单词,因为该函数中使用的正则表达式要求单词边界出现在&#34; word&#34;因为这似乎不是边界的公认角色。参见

gsub("\uf0b7","",sample)
# [1] "Crazy  Character"
gsub("\\b\uf0b7\\b","",sample)
#[1] "Crazy  Character"

因此我们可以编写自己的函数,我们可以使用tm_map。考虑

removeCharacters <-function (x, characters)  {
gsub(sprintf("(*UCP)(%s)", paste(characters, collapse = "|")), "", x, perl = TRUE)
}

基本上就是没有边界条件的removeWords函数。然后我们可以运行

cp2 <- tm_map(cp, removeCharacters, c("\uf0b7","\uf0a0"))
inspect(DocumentTermMatrix(cp2))

# A document-term matrix (1 documents, 2 terms)
# 
# Non-/sparse entries: 2/0
# Sparsity           : 0%
# Maximal term length: 9 
# Weighting          : term frequency (tf)
# 
#     Terms
# Docs character crazy
#    1         1     1

我们看到那些unicode字符不再存在。